Skip to content

Commit

Permalink
Spark's Tips and Tricks
Browse files Browse the repository at this point in the history
  • Loading branch information
jaceklaskowski committed Sep 14, 2024
1 parent e5ccbaf commit ac57607
Show file tree
Hide file tree
Showing 10 changed files with 192 additions and 204 deletions.
135 changes: 0 additions & 135 deletions docs/spark-tips-and-tricks-running-spark-windows.md

This file was deleted.

4 changes: 4 additions & 0 deletions docs/tips-and-tricks/.pages
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
title: Spark's Tips and Tricks
nav:
- index.md
- ...
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
== Access private members in Scala in Spark shell
# Access private members in Scala in Spark shell

If you ever wanted to use `private[spark]` members in Spark using the Scala programming language, e.g. toy with `org.apache.spark.scheduler.DAGScheduler` or similar, you will have to use the following trick in Spark shell - use `:paste -raw` as described in https://issues.scala-lang.org/browse/SI-5299[REPL: support for package definition].
If you ever wanted to use `private[spark]` members in Spark using the Scala programming language, e.g. toy with `org.apache.spark.scheduler.DAGScheduler` or similar, you will have to use the following trick in Spark shell - use `:paste -raw` as described in [REPL: support for package definition](https://issues.scala-lang.org/browse/SI-5299).

Open `spark-shell` and execute `:paste -raw` that allows you to enter any valid Scala code, including `package`.

The following snippet shows how to access `private[spark]` member `DAGScheduler.RESUBMIT_TIMEOUT`:

```
```text
scala> :paste -raw
// Entering paste mode (ctrl-D to finish)
Expand Down
23 changes: 9 additions & 14 deletions docs/spark-tips-and-tricks.md β†’ docs/tips-and-tricks/index.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,34 @@
= Spark Tips and Tricks
# Spark's Tips and Tricks

== [[SPARK_PRINT_LAUNCH_COMMAND]] Print Launch Command of Spark Scripts
## Print Launch Command of Spark Scripts { #SPARK_PRINT_LAUNCH_COMMAND }

`SPARK_PRINT_LAUNCH_COMMAND` environment variable controls whether the Spark launch command is printed out to the standard error output, i.e. `System.err`, or not.

```
Spark Command: [here comes the command]
========================================
```
`SPARK_PRINT_LAUNCH_COMMAND` environment variable controls whether or not the Spark launch command is printed out to the standard error output.

All the Spark shell scripts use `org.apache.spark.launcher.Main` class internally that checks `SPARK_PRINT_LAUNCH_COMMAND` and when set (to any value) will print out the entire command line to launch it.

```
```text
$ SPARK_PRINT_LAUNCH_COMMAND=1 ./bin/spark-shell
Spark Command: /Library/Java/JavaVirtualMachines/Current/Contents/Home/bin/java -cp /Users/jacek/dev/oss/spark/conf/:/Users/jacek/dev/oss/spark/assembly/target/scala-2.11/spark-assembly-1.6.0-SNAPSHOT-hadoop2.7.1.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-core-3.2.10.jar:/Users/jacek/dev/oss/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar -Dscala.usejavacp=true -Xms1g -Xmx1g org.apache.spark.deploy.SparkSubmit --master spark://localhost:7077 --class org.apache.spark.repl.Main --name Spark shell spark-shell
========================================
```

== Show Spark version in Spark shell
## Show Spark version in Spark shell

In spark-shell, use `sc.version` or `org.apache.spark.SPARK_VERSION` to know the Spark version:

```
```text
scala> sc.version
res0: String = 1.6.0-SNAPSHOT
scala> org.apache.spark.SPARK_VERSION
res1: String = 1.6.0-SNAPSHOT
```

== Resolving local host name
## Resolving local host name

When you face networking issues when Spark can't resolve your local hostname or IP address, use the preferred `SPARK_LOCAL_HOSTNAME` environment variable as the custom host name or `SPARK_LOCAL_IP` as the custom IP that is going to be later resolved to a hostname.

Spark checks them out before using http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--[java.net.InetAddress.getLocalHost()] (consult https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759[org.apache.spark.util.Utils.findLocalInetAddress()] method).
Spark checks them out before using [java.net.InetAddress.getLocalHost()](http://docs.oracle.com/javase/8/docs/api/java/net/InetAddress.html#getLocalHost--) (consult [org.apache.spark.util.Utils.findLocalInetAddress()](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L759) method).

You may see the following WARN messages in the logs when Spark finished the resolving process:

Expand All @@ -44,7 +39,7 @@ Set SPARK_LOCAL_IP if you need to bind to another address

## Starting standalone Master and workers on Windows 7

Windows 7 users can use [spark-class](tools/spark-class.md) to start Spark Standalone as there are no launch scripts for the Windows platform.
Windows 7 users can use [spark-class](../tools/spark-class.md) to start Spark Standalone as there are no launch scripts for the Windows platform.

```text
./bin/spark-class org.apache.spark.deploy.master.Master -h localhost
Expand Down
132 changes: 132 additions & 0 deletions docs/tips-and-tricks/running-spark-windows.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Running Spark Applications on Windows

Running Spark applications on Windows in general is no different than running it on other operating systems like Linux or macOS.

!!! note
A Spark application could be [spark-shell](../tools/spark-shell.md) or your own custom Spark application.

What makes a very important difference between the operating systems is Apache Hadoop that is used internally in Spark for file system access.

You may run into few minor issues when you are on Windows due to the way Hadoop works with Windows' POSIX-incompatible NTFS filesystem.

!!! note
You are not required to install Apache Hadoop to develop or run Spark applications.

!!! tip
Read the Apache Hadoop project's [Problems running Hadoop on Windows](https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems).

Among the issues is the infamous `java.io.IOException` while running Spark Shell (below a stacktrace from Spark 2.0.2 on Windows 10 so the line numbers may be different in your case).

```text
16/12/26 21:34:11 ERROR Shell: Failed to locate the winutils binary in the hadoop binary path
java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
at org.apache.hadoop.util.Shell.getQualifiedBinPath(Shell.java:379)
at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:394)
at org.apache.hadoop.util.Shell.<clinit>(Shell.java:387)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.findHadoopBinary(HiveConf.java:2327)
at org.apache.hadoop.hive.conf.HiveConf$ConfVars.<clinit>(HiveConf.java:365)
at org.apache.hadoop.hive.conf.HiveConf.<clinit>(HiveConf.java:105)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:228)
at org.apache.spark.sql.SparkSession$.hiveClassesArePresent(SparkSession.scala:963)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:91)
```

!!! note
You need to have Administrator rights on your laptop.
All the following commands must be executed in a command-line window (`cmd`) ran as Administrator (i.e., using **Run as administrator** option while executing `cmd`).

Download `winutils.exe` binary from [steveloughran/winutils](https://github.com/steveloughran/winutils) Github repository.

!!! note
Select the version of Hadoop the Spark distribution was compiled with, e.g. use `hadoop-2.7.1` for Spark 2 ([here is the direct link to `winutils.exe` binary](https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe)).

Save `winutils.exe` binary to a directory of your choice (e.g., `c:\hadoop\bin`).

Set `HADOOP_HOME` to reflect the directory with `winutils.exe` (without `bin`).

```text
set HADOOP_HOME=c:\hadoop
```

Set `PATH` environment variable to include `%HADOOP_HOME%\bin` as follows:

```text
set PATH=%HADOOP_HOME%\bin;%PATH%
```

!!! tip
Define `HADOOP_HOME` and `PATH` environment variables in Control Panel so any Windows program would use them.

Create `C:\tmp\hive` directory.

!!! note
`c:\tmp\hive` directory is the default value of [`hive.exec.scratchdir` configuration property](https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.scratchdir) in Hive 0.14.0 and later and Spark uses a custom build of Hive 1.2.1.

You can change `hive.exec.scratchdir` configuration property to another directory as described in [Changing `hive.exec.scratchdir` Configuration Property](#changing-hive.exec.scratchdir) in this document.

Execute the following command in `cmd` that you started using the option **Run as administrator**.

```text
winutils.exe chmod -R 777 C:\tmp\hive
```

Check the permissions (that is one of the commands that are executed under the covers):

```text
winutils.exe ls -F C:\tmp\hive
```

Open `spark-shell` and observe the output (perhaps with few WARN messages that you can simply disregard).

As a verification step, execute the following line to display the content of a `DataFrame`:

```text
scala> spark.range(1).withColumn("status", lit("All seems fine. Congratulations!")).show(false)
+---+--------------------------------+
|id |status |
+---+--------------------------------+
|0 |All seems fine. Congratulations!|
+---+--------------------------------+
```

!!! note
Disregard WARN messages when you start `spark-shell`. They are harmless.

```text
16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of
the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-core-3.2.10.jar" is already registered,
and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-core-
3.2.10.jar."
16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR
versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/jars/datanucleus-api-jdo-3.2.6.jar" is already
registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-
hadoop2.7/bin/../jars/datanucleus-api-jdo-3.2.6.jar."
16/12/26 22:05:41 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR
versions of the same plugin in the classpath. The URL "file:/C:/spark-2.0.2-bin-hadoop2.7/bin/../jars/datanucleus-rdbms-3.2.9.jar" is
already registered, and you are trying to register an identical plugin located at URL "file:/C:/spark-2.0.2-bin-
hadoop2.7/jars/datanucleus-rdbms-3.2.9.jar."
```

If you see the above output, you're done! You should now be able to run Spark applications on your Windows. Congrats! πŸ‘πŸ‘πŸ‘

## Changing hive.exec.scratchdir { #changing-hive.exec.scratchdir }

Create a `hive-site.xml` file with the following content:

```xml
<configuration>
<property>
<name>hive.exec.scratchdir</name>
<value>/tmp/mydir</value>
<description>Scratch space for Hive jobs</description>
</property>
</configuration>
```

Start a Spark application (e.g., `spark-shell`) with `HADOOP_CONF_DIR` environment variable set to the directory with `hive-site.xml`.

```text
HADOOP_CONF_DIR=conf ./bin/spark-shell
```
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
== org.apache.spark.SparkException: Task not serializable
# org.apache.spark.SparkException: Task not serializable

When you run into `org.apache.spark.SparkException: Task not serializable` exception, it means that you use a reference to an instance of a non-serializable class inside a transformation. See the following example:

```
```text
➜ spark git:(master) βœ— ./bin/spark-shell
Welcome to
____ __
Expand Down Expand Up @@ -68,8 +68,8 @@ Serialization stack:
... 57 more
```

=== Further reading
## Learn More

* https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html[Job aborted due to stage failure: Task not serializable]
* https://issues.apache.org/jira/browse/SPARK-5307[Add utility to help with NotSerializableException debugging]
* http://stackoverflow.com/q/22592811/1305344[Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects]
* [Job aborted due to stage failure: Task not serializable](https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html)
* [Add utility to help with NotSerializableException debugging](https://issues.apache.org/jira/browse/SPARK-5307)
* [Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects](http://stackoverflow.com/q/22592811/1305344)
Loading

0 comments on commit ac57607

Please sign in to comment.