Skip to content
Spiro Michaylov edited this page Aug 7, 2016 · 10 revisions

Spark Compatibility Notes

Whenever these examples are updated for a new Spark version, changes tend to be needed, and some are interesting and important. Starting with Spark 1.5.0, here are the details.

Spark 2.0.0

Build failures

  1. dataframe/UDT.scala
  2. graphx/SecondDegreeNeighbors.scala
  3. hiveql/LateralViewExplode.scala
  4. hiveql/SimpleUDAF.scala
  5. hiveql/SimpleUDF.scala
  6. sql/CustomRelationProvider.scala
  7. sql/ExternalNonRectangular.scala
  8. sql/JSON.scala
  9. sql/JSONScehmaInference.scala
  10. sql/JSONTypes.scala
  11. sql/MixedJSONQuery.scala
  12. sql/OutputJSON.scala
  13. sql/RelationProviderFilterPushdown.scala
  14. sql/SchemaConversion.scala
  15. sql/UDF.scala
  16. sql/UDT.scala
  17. streaming/CustomReceiver.scala

Problems Resolved

  1. Dataframe.foreach needs more careful passing of println because of overload
  2. User defined types (UDTs) have been removed
  3. DataSet.partitionBy not supported -- seems to prefer Repartition
  4. graphx.mapReduceTriplets was deprecated in 1.2 and now gone -- replaced by aggregateMessages
  5. Missing org.apache.spark.Logging but wasn't using it anyway
  6. A dataset is not an RDD

Deprecations

  1. dataframe.registerTempTable
  2. SQLContext

Spark 1.6.0

  1. If you are running the examples through sbt, you will now get a non-default JVM MaxPermSize setting so that the hiveql examples get enough memory to run.
  2. The two versions of UDT.scala (one in dataframe and one in sql) used to depend on the fact that ArrayData and GenericArrayData were public in org.apache.spark.sql.types. This is no longer true in Spark 1.6.0, due to SPARK-11273. More recently, as a result of SPARK-11780, deprecated type aliases have been added back, and this change is slated for Spark 1.6.1. Frankly, I find this change a bit disturbing, since every attept at defining a user defined type anywhere in the Spark source tree requires these two types for serialization and deserialization -- see, for example, ExamplePointUDT.scala and UserDefinedTypeSuite.scala. I can understand that perhaps functionality needed for UDTs shouldn't pollute the general purpose public APIs, but then I would argue that support for them needs a package of its own.

Spark 1.5.0

  1. The Hive examples (hive.*) were failing with memory problems. To execute them, one has to supply the flag -XX:MaxPermSize=128M to the JVM somehow. THis settign works for these examples, but whether it is the "right" setting in practice depends on ytour application.
  2. sql.OutputJSON needed to be extended because it seems that JSON integers that were being interpreted as ints are now interpeted as longs.
  3. sql.Types had to change quite a lot because the type conversions seem to have become a lot more stringent.
  4. In dealing with a deprecation in sql.JSONTypes, I can't find a supported way to provide a schema when reading a JSON file. This is not strictly speaking a 1.5.0 problem. I'll keep looking for a solution.
  5. The last example (passing an array to a UDF) in dataframe.UDF needed to be changed because passing an array now results in the UDF receiving a WrappedArray rather than an ArrayBuffer.
  6. With the introduction of more systematic reading and writing for dataframes, I took this opportunity to replace all uses of the older, deprecated techniques.
  7. Again not really a 1.5.0 problem, but there are two more deprecations I couldn't find a good way to deal with:
    1. In hiveql.UDAF the deprecated approach is the only one I've been able to figure out so far.
    2. In queue based streaming, scala.collection.mutable.SynchronizedQueue has been deprecated for some time, but there doesn't seem to be a non-deprecated replacement that StreamignContext,queueStream() will accept as an input.