Spark's Tips and Tricks

japila-books · Sep 14, 2024 · ac57607 · ac57607
1 parent e5ccbaf
commit ac57607
Show file tree

Hide file tree

Showing 10 changed files with 192 additions and 204 deletions.
diff --git a/docs/spark-tips-and-tricks-running-spark-windows.md b/docs/spark-tips-and-tricks-running-spark-windows.md
diff --git a/docs/tips-and-tricks/.pages b/docs/tips-and-tricks/.pages
@@ -0,0 +1,4 @@
+title: Spark's Tips and Tricks
+nav:
+    - index.md
+    - ...
diff --git a/...cks-access-private-members-spark-shell.md → ...cks/access-private-members-spark-shell.md b/...cks-access-private-members-spark-shell.md → ...cks/access-private-members-spark-shell.md
@@ -1,12 +1,12 @@
-== Access private members in Scala in Spark shell
+# Access private members in Scala in Spark shell
 
-If you ever wanted to use `private[spark]` members in Spark using the Scala programming language, e.g. toy with `org.apache.spark.scheduler.DAGScheduler` or similar, you will have to use the following trick in Spark shell - use `:paste -raw` as described in https://issues.scala-lang.org/browse/SI-5299[REPL: support for package definition].
+If you ever wanted to use `private[spark]` members in Spark using the Scala programming language, e.g. toy with `org.apache.spark.scheduler.DAGScheduler` or similar, you will have to use the following trick in Spark shell - use `:paste -raw` as described in [REPL: support for package definition](https://issues.scala-lang.org/browse/SI-5299).
 
 Open `spark-shell` and execute `:paste -raw` that allows you to enter any valid Scala code, including `package`.
 
 The following snippet shows how to access `private[spark]` member `DAGScheduler.RESUBMIT_TIMEOUT`:
 
-```
+```text
 scala> :paste -raw
 // Entering paste mode (ctrl-D to finish)
 

diff --git a/docs/spark-tips-and-tricks.md → docs/tips-and-tricks/index.md b/docs/spark-tips-and-tricks.md → docs/tips-and-tricks/index.md
@@ -1,39 +1,34 @@
-= Spark Tips and Tricks
+# Spark's Tips and Tricks
 
-== [[SPARK_PRINT_LAUNCH_COMMAND]] Print Launch Command of Spark Scripts
+## Print Launch Command of Spark Scripts { #SPARK_PRINT_LAUNCH_COMMAND }
 
-`SPARK_PRINT_LAUNCH_COMMAND` environment variable controls whether the Spark launch command is printed out to the standard error output, i.e. `System.err`, or not.
-
-```
-Spark Command: [here comes the command]
-========================================
-```
+`SPARK_PRINT_LAUNCH_COMMAND` environment variable controls whether or not the Spark launch command is printed out to the standard error output.
 
 All the Spark shell scripts use `org.apache.spark.launcher.Main` class internally that checks `SPARK_PRINT_LAUNCH_COMMAND` and when set (to any value) will print out the entire command line to launch it.
 
-```
+```text
 $ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
 Spark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar -Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://localhost:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell
 ========================================
 ```
 
-== Show Spark version in Spark shell
+## Show Spark version in Spark shell
 
 In spark-shell, use `sc.version` or `org.apache.spark.SPARK_VERSION` to know the Spark version:
 
-```
+```text
 scala> sc.version
 res0: String = 1.6.0-SNAPSHOT
 
 scala> org.apache.spark.SPARK_VERSION
 res1: String = 1.6.0-SNAPSHOT
 ```
 
-== Resolving local host name
+## Resolving local host name
 
 When you face networking issues when Spark can't resolve your local hostname or IP address, use the preferred `SPARK_LOCAL_HOSTNAME` environment variable as the custom host name or `SPARK_LOCAL_IP` as the custom IP that is going to be later resolved to a hostname.
 
-Spark checks them out before using http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--[java.net.InetAddress.getLocalHost()] (consult https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759[org.apache.spark.util.Utils.findLocalInetAddress()] method).
+Spark checks them out before using [java.net.InetAddress.getLocalHost()](http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--) (consult [org.apache.spark.util.Utils.findLocalInetAddress()](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759) method).
 
 You may see the following WARN messages in the logs when Spark finished the resolving process:
 
@@ -44,7 +39,7 @@ Set SPARK_LOCAL_IP if you need to bind to another address
 
 ## Starting standalone Master and workers on Windows 7
 
-Windows 7 users can use [spark-class](tools/spark-class.md) to start Spark Standalone as there are no launch scripts for the Windows platform.
+Windows 7 users can use [spark-class](../tools/spark-class.md) to start Spark Standalone as there are no launch scripts for the Windows platform.
 
 ```text
 ./bin/spark-class org.apache.spark.deploy.master.Master -h localhost

diff --git a/docs/tips-and-tricks/running-spark-windows.md b/docs/tips-and-tricks/running-spark-windows.md
@@ -0,0 +1,132 @@
+# Running Spark Applications on Windows
+
+Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.
+
+!!! note
+    A Spark application could be [spark-shell](../tools/spark-shell.md) or your own custom Spark application.
+
+What makes a very important difference between the operating systems is Apache Hadoop that is used internally in Spark for file system access.
+
+You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.
+
+!!! note
+    You are not required to install Apache Hadoop to develop or run Spark applications.
+
+!!! tip
+    Read the Apache Hadoop project's [Problems running Hadoop on Windows](https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems).
+
+Among the issues is the infamous `java.io.IOException` while running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).
+
+```text
+16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
+java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
+  at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
+  at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
+  at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
+  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)
+  at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365)
+  at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
+  at java.lang.Class.forName0(Native Method)
+  at java.lang.Class.forName(Class.java:348)
+  at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
+  at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963)
+  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)
+```
+
+!!! note
+    You need to have Administrator rights on your laptop.
+    All the following commands must be executed in a command-line window (`cmd`) ran as Administrator (i.e., using **Run as administrator** option while executing `cmd`).
+
+Download `winutils.exe` binary from [steveloughran/winutils](https://github.com/steveloughran/winutils) Github repository.
+
+!!! note
+    Select the version of Hadoop the Spark distribution was compiled with, e.g. use `hadoop-2.7.1` for Spark 2 ([here is the direct link to `winutils.exe` binary](https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe)).
+
+Save `winutils.exe` binary to a directory of your choice (e.g., `c:\hadoop\bin`).
+
+Set `HADOOP_HOME` to reflect the directory with `winutils.exe` (without `bin`).
+
+```text
+set HADOOP_HOME=c:\hadoop
+```
+
+Set `PATH` environment variable to include `%HADOOP_HOME%\bin` as follows:
+
+```text
+set PATH=%HADOOP_HOME%\bin;%PATH%
+```
+
+!!! tip
+    Define `HADOOP_HOME` and `PATH` environment variables in Control Panel so any Windows program would use them.
+
+Create `C:\tmp\hive` directory.
+
+!!! note
+    `c:\tmp\hive` directory is the default value of [`hive.exec.scratchdir` configuration property](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir) in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.
+
+    You can change `hive.exec.scratchdir` configuration property to another directory as described in [Changing `hive.exec.scratchdir` Configuration Property](#changing-hive.exec.scratchdir) in this document.
+
+Execute the following command in `cmd` that you started using the option **Run as administrator**.
+
+```text
+winutils.exe chmod -R 777 C:\tmp\hive
+```
+
+Check the permissions (that is one of the commands that are executed under the covers):
+
+```text
+winutils.exe ls -F C:\tmp\hive
+```
+
+Open `spark-shell` and observe the output (perhaps with few WARN messages that you can simply disregard).
+
+As a verification step, execute the following line to display the content of a `DataFrame`:
+
+```text
+scala> spark.range(1).withColumn("status", lit("All seems fine. Congratulations!")).show(false)
++---+--------------------------------+
+|id |status                          |
++---+--------------------------------+
+|0  |All seems fine. Congratulations!|
++---+--------------------------------+
+```
+
+!!! note
+    Disregard WARN messages when you start `spark-shell`. They are harmless.
+
+    ```text
+    16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of
+    the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered,
+    and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core-
+    3.2.10.jar."
+    16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR
+    versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar" is already
+    registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-
+    hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar."
+    16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR
+    versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar" is
+    already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-
+    hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar."
+    ```
+
+If you see the above output, you're done! You should now be able to run Spark applications on your Windows. Congrats! 👏👏👏
+
+## Changing hive.exec.scratchdir { #changing-hive.exec.scratchdir }
+
+Create a `hive-site.xml` file with the following content:
+
+```xml
+<configuration>
+  <property>
+    <name>hive.exec.scratchdir</name>
+    <value>/tmp/mydir</value>
+    <description>Scratch space for Hive jobs</description>
+  </property>
+</configuration>
+```
+
+Start a Spark application (e.g., `spark-shell`) with `HADOOP_CONF_DIR` environment variable set to the directory with `hive-site.xml`.
+
+```text
+HADOOP_CONF_DIR=conf ./bin/spark-shell
+```
diff --git a/...s-sparkexception-task-not-serializable.md → ...s/sparkexception-task-not-serializable.md b/...s-sparkexception-task-not-serializable.md → ...s/sparkexception-task-not-serializable.md
@@ -1,8 +1,8 @@
-== org.apache.spark.SparkException: Task not serializable
+# org.apache.spark.SparkException: Task not serializable
 
 When you run into `org.apache.spark.SparkException: Task not serializable` exception, it means that you use a reference to an instance of a non-serializable class inside a transformation. See the following example:
 
-```
+```text
 ➜  spark git:(master) ✗ ./bin/spark-shell
 Welcome to
       ____              __
@@ -68,8 +68,8 @@ Serialization stack:
   ... 57 more
 ```
 
-=== Further reading
+## Learn More
 
-* https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html[Job aborted due to stage failure: Task not serializable]
-* https://issues.apache.org/jira/browse/SPARK-5307[Add utility to help with NotSerializableException debugging]
-* http://stackoverflow.com/q/22592811/1305344[Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects]
+* [Job aborted due to stage failure: Task not serializable](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html)
+* [Add utility to help with NotSerializableException debugging](https://issues.apache.org/jira/browse/SPARK-5307)
+* [Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects](http://stackoverflow.com/q/22592811/1305344)