Skip to content

Commit

Permalink
[#162] Swap to using ADD JAR instead of spark.jars.packages
Browse files Browse the repository at this point in the history
This should hopefully let executors find the jars as well as the driver. I've
added some comments because this is a bit gnarly.
  • Loading branch information
riley-harper committed Nov 22, 2024
1 parent 7864432 commit 7dcc81d
Showing 1 changed file with 17 additions and 5 deletions.
22 changes: 17 additions & 5 deletions hlink/spark/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,13 +71,13 @@ def spark_conf(self, executor_cores, executor_memory, driver_memory, cores):
if os.path.isfile(jar_path):
conf = conf.set("spark.jars", jar_path)

# If the SynapseML Python package is available, include the Scala
# package as well. Note that we have to pin to a particular version of
# the Scala package here.
# A bit of a kludge. We set spark.jars.repositories here in the configuration,
# but then we actually download the SynapseML Scala jar later in connect().
# See the comment on the ADD JAR SQL statement in connect() for some more
# context.
#
# SynapseML used to be named MMLSpark.
# SynapseML used to be named MMLSpark, thus the URL.
if _synapse_ml_available:
conf.set("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.8")
conf.set("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")

return conf
Expand Down Expand Up @@ -117,6 +117,18 @@ def connect(
session.catalog.setCurrentDatabase(self.db_name)
session.sparkContext.setCheckpointDir(str(self.tmp_dir))
self._register_udfs(session)

# If the SynapseML Python package is available, include the Scala
# package as well. Note that we have to pin to a particular version of
# the Scala package here.
#
# Despite what the documentation for the spark.jars.packages config setting
# says, this is the only way that I have found to include this jar for both
# the driver and the executors. Setting spark.jars.packages caused errors
# because the executors could not find the jar.
if _synapse_ml_available:
session.sql("ADD JAR ivy://com.microsoft.azure:synapseml_2.12:1.0.8")

return session

def _register_udfs(self, session):
Expand Down

0 comments on commit 7dcc81d

Please sign in to comment.