Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Scala 2.12 #332

Open
koertkuipers opened this issue Jun 14, 2019 · 39 comments
Open

Support Scala 2.12 #332

koertkuipers opened this issue Jun 14, 2019 · 39 comments

Comments

@koertkuipers
Copy link

Spark 2.4 has Scala 2.11 and 2.12 builds. Starting with Spark 3.0 Scala 2.12 will become default and Scala 2.11 will be dropped.

This depends on #184

This will require:

  • Changes to the gradle build support scala 2.11 and 2.12. See kafka for an example how to do this.
  • Some hopefully minor code changes for compilation in Scala 2.12
  • Updates/upgrades to dependencies. Some are not build for Scala 2.12 at all, other only have newer versions available for Scala 2.12
@koertkuipers
Copy link
Author

Starting from #184 first issue i ran into is gradle-scalastyle-plugin not found for scala 2.12. resolved by adding this maven repo:

        maven {
            name = "ngbinh"
            url = "https://dl.bintray.com/ngbinh/maven"
        }

@koertkuipers
Copy link
Author

To stop scala compiler from exiting with strange messages like assertion failed: ClassBType.info not yet assigned: Lorg/apache/spark/rdd/RDD; i removed -optimize flag

@tovbinm
Copy link
Collaborator

tovbinm commented Jun 14, 2019

Just a heads up - with merged #274 and with Scala 2.12 compile the majority of the tests will fail, e.g. UnaryLamdaTransformerTest, due to stage serialization issues. We need to update all lambda functions with concrete classes as explained in #274

@tovbinm
Copy link
Collaborator

tovbinm commented Jun 14, 2019

Thanks for making the effort to upgrade and support both Scala 2.11/12 versions @koertkuipers !!

@koertkuipers
Copy link
Author

after publishing my own in-house versions of xgboost and mleap it seems everything compiles and tests fine using scala 2.12. seemed a little too easy... wonder if i am not running tests correctly.

this is before #274
i get a bunch of merge conflicts with that one that i need to take a look at. since i didnt make any significant changes the merge conflicts are likely with spark-2.4 branch.

@koertkuipers
Copy link
Author

xgboost not having scala version in package name doesnt help
e.g. xgboost4j-spark-0.90.jar vs xgboost4j-spark_2.11-0.90.jar

@koertkuipers
Copy link
Author

see also dmlc/xgboost#4574

@koertkuipers
Copy link
Author

i build xgboost 0.90 for scala 2.11/2.12 here and published it to our bintray

i build mleap for scala 2.11/2.12 here and published it also to our bintray

my branch for transmogrifai for scala 2.12 is feat-scala212. its currently set to build for only scala 2.12 since i don't yet know how to do both with gradle in one pass but it also builds fine for scala 2.11 if you simply change scalaVersion and scalaVersionRevision

@tovbinm
Copy link
Collaborator

tovbinm commented Jun 21, 2019

Nice! Please pull the latest from https://github.com/salesforce/TransmogrifAI/tree/mt/spark-2.4 since I just updated it.

I would like to try adding a cross build for 2.11/12

@koertkuipers
Copy link
Author

i expected to see serialization issues after merging in https://github.com/salesforce/TransmogrifAI/tree/mt/spark-2.4 due to #274 but i don't 😕

i see you put a lot of functions in companion objects like:

object BinaryTransformerTest {
  def fn: (Real, RealNN) => Real = (i1, i2) => new Real(for {v1 <- i1.value; v2 <- i2.value} yield v1 / (v2 * v2))
}

do you expect this approach to work for scala 2.12?

@tovbinm
Copy link
Collaborator

tovbinm commented Jun 21, 2019

I think it works only because models are serialized and loaded within the same code. If you train a model, save it then recompile the codebase and try to load the model - it would fail.

@wsuchy
Copy link
Contributor

wsuchy commented Jun 21, 2019

@koertkuipers I agree here with @tovbinm, the problem is that in Scala 2.12 all lambda functions are given random names for each time you compile your code. So model trained on your laptop will not load in your production environment due to missing references. We are currently working on the fix for that.

@koertkuipers
Copy link
Author

@tovbinm @wsuchy is there any way you can think of to test this? would forking for tests do it?

@tovbinm
Copy link
Collaborator

tovbinm commented Jun 23, 2019

The test should be constructed a follows:

  1. Train several models (binclass, multi class and regression) with all feature types (example).
  2. Save these models as a part of core/test/resources like this one.
  3. Add a new test that would load these models and score them.

This way we would have a set of models what would contain stages saved from previous JVM runs and would allow us testing if the models would load & produce scores correctly.

@tovbinm
Copy link
Collaborator

tovbinm commented Jun 23, 2019

I propose to follow the migration guide for 2.12 in #274 and replace all the lambda functions introduced with concrete classes.

@koertkuipers
Copy link
Author

after publishing my own in-house versions of xgboost and mleap it seems everything compiles and tests fine using scala 2.12. seemed a little too easy... wonder if i am not running tests correctly.

this is before #274
i get a bunch of merge conflicts with that one that i need to take a look at. since i didnt make any significant changes the merge conflicts are likely with spark-2.4 branch.

i spoke to soon about everything testing fine. i was using compileTestScala but now i tried test which seems to run more tests and some fail. so i got some more work to do.

@koertkuipers
Copy link
Author

all the tests except for one pass before #274. the only failure before #274 is:

  [-] RecordInsightsLOCO should return the most predictive features for data generated with a strong relation to the label (1.33 secs)

com.salesforce.op.stages.impl.insights.RecordInsightsLOCOTest > RecordInsightsLOCO should return the most predictive features for data generated with a strong relation to the label FAILED
    org.scalatest.exceptions.TestFailedException: 0.953500197613813 was not less than 0.8 The ratio of feature strengths between important and other features should be similar to the ratio of feature importances from Spark's RandomForest

after #274 i have lots of failures. but this is good news as this was expected and we know how to fix them.
the failures look like this:

    java.lang.RuntimeException: Failed to write out stage 'uid_1234'

        Caused by:
        java.lang.RuntimeException: Argument 'transformFn' [com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838] cannot be serialized. Make sure com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838 has either no-args ctor or is an object, and does not have any external dependencies, e.g. use any out of scope variables.

            Caused by:
            java.lang.RuntimeException: Failed to create an instance of class 'com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838'. Class has to either have a no-args ctor or be an object.

                Caused by:
                java.lang.ClassNotFoundException: com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838

@koertkuipers
Copy link
Author

koertkuipers commented Jul 8, 2019

i tried to convert some lambdas to concrete classes as explained in #274 migration notes.

for example in PassengerFeaturesTest.scala this:

def genderFn: Passenger => MultiPickList = p => Set(p.getGender).toMultiPickList

i replaced by:

  class GenderFn extends Function1[Passenger, MultiPickList] {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }

this did not work. i now get:

        java.io.NotSerializableException: com.salesforce.op.test.PassengerFeaturesTestLambdas$GenderFn

however it works if i add Serializable:

  class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }

@koertkuipers
Copy link
Author

i tried to convert some lambdas to concrete classes as explained in #274 migration notes.

for example in PassengerFeaturesTest.scala this:

def genderFn: Passenger => MultiPickList = p => Set(p.getGender).toMultiPickList

i replaced by:

  class GenderFn extends Function1[Passenger, MultiPickList] {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }

this did not work. i now get:

        java.io.NotSerializableException: com.salesforce.op.test.PassengerFeaturesTestLambdas$GenderFn

however it works if i add Serializable:

  class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
  }

i always thought scala functions extend serializable. but its a little more complex than that:

scala> class X extends Function1[String, String]{ def apply(s: String) = s + "!" }
defined class X

scala> val x = new X
x: X = <function1>

scala> x.asInstanceOf[Serializable]
java.lang.ClassCastException: X cannot be cast to scala.Serializable
  ... 32 elided

scala> val y = { (s: String) => s + "!" }
y: String => String = <function1>

scala> y.asInstanceOf[Serializable]
res6: Serializable = <function1>

@koertkuipers
Copy link
Author

@koertkuipers I agree here with @tovbinm, the problem is that in Scala 2.12 all lambda functions are given random names for each time you compile your code. So model trained on your laptop will not load in your production environment due to missing references. We are currently working on the fix for that.

@wsuchy @tovbinm is what i am seeing the same issue? or is this something else?

i am seeing:

    java.lang.RuntimeException: Failed to write out stage 'uid_1234'

        Caused by:
        java.lang.RuntimeException: Argument 'transformFn' [com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838] cannot be serialized. Make sure com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838 has either no-args ctor or is an object, and does not have any external dependencies, e.g. use any out of scope variables.

            Caused by:
            java.lang.RuntimeException: Failed to create an instance of class 'com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838'. Class has to either have a no-args ctor or be an object.

                Caused by:
                java.lang.ClassNotFoundException: com.salesforce.op.stages.Lambdas$$$Lambda$5968/1132504838

@wsuchy
Copy link
Contributor

wsuchy commented Jul 9, 2019

@koertkuipers the way I see this working is to explicitly wrap lambdas inside a specific object (one lambda per one object) e.g:

trait TMOGLambda extends Serializable{
  val f: (_) => _
}
object Lambda1 extends TMOGLambda {
 override val f =  (x: Int)=> x.toString
}

And then have this supported by serializer to check presence of that object and hardcode it to always point to f or whatever we name the property holding our lambda.

@tovbinm
Copy link
Collaborator

tovbinm commented Jul 9, 2019

I think it cleaner to have classes extend FunctionX base classes, i.e.

class GenderFn extends Function1[Passenger, MultiPickList] with Serializable {
    def apply(p: Passenger): MultiPickList = Set(p.getGender).toMultiPickList
}

@wsuchy
Copy link
Contributor

wsuchy commented Jul 9, 2019

@tovbinm the problem will be supporting it from macros, as at the time we evaluate them there is type information.

@koertkuipers
Copy link
Author

@tovbinm @wsuchy take a look at tresata-opensource@95095ed i simply replaced lambdas with concrete classes until all the tests passed. its not pretty but it works.

i am unsure if i have to add more tests for workflow independent model loading, e.g. load model from json files.

@wsuchy
Copy link
Contributor

wsuchy commented Jul 12, 2019

Found this out today:

In our hello world example, the code like this:

val sepalLength = FeatureBuilder.Real[Iris].extract(_.sepalLength.toReal).asPredictor

that result in json containing:

 "extractFn" : {
        "className" : "com.salesforce.hw.OpIrisSimple$$anonfun$5",
        "value" : {
          "className" : "com.salesforce.hw.OpIrisSimple$$anonfun$5"
        }
      },

Even though one can instantiate back the workflow, the names are still volatile and depend on the order of occurrence. I think we should throw an exception instead.

@koertkuipers
Copy link
Author

ok we now have all tests pass in scala 2.12 and are fully caught up with master branch
see:
https://github.com/tresata-opensource/TransmogrifAI/tree/feat-scala212

@tovbinm
Copy link
Collaborator

tovbinm commented Aug 8, 2019

That’s outstanding!! It would be great to have it merged into our codebase together with a cross build changes for Scala 2.11/2.12.

@tovbinm
Copy link
Collaborator

tovbinm commented Aug 24, 2019

@koertkuipers are there any plans on your end to raise a PR with Scala 2.12 changes? ;)

@koertkuipers
Copy link
Author

koertkuipers commented Aug 25, 2019 via email

@tovbinm
Copy link
Collaborator

tovbinm commented Aug 25, 2019

I can help to add it. Let’s start from what you already have.

@Helw150
Copy link

Helw150 commented Aug 30, 2019

edit: Ah I just saw that you are reverting back to Spark 2.3 so I'll assume this is blocked until the overall repo is ready for 2.4!

@tovbinm I'm not particularly well versed in Gradle, but cross compiled Transmogrifai is something I'm interested in having so I'm happy to spend some time this weekend taking a stab if you don't have free cycles.

Have you had a chance to try from @koertkuipers branch or should I start?

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 3, 2019

We are planning to keep Spark 2.3 with Scala 2.11 in our master branch. Additionally, we would maintain the Spark 2.4 with Scala 2.11/12 branch separately, until the team is ready to move to Spark 2.4.

I havent had a chance to review @koertkuipers branch. Where is it?

@koertkuipers
Copy link
Author

We are planning to keep Spark 2.3 with Scala 2.11 in our master branch. Additionally, we would maintain the Spark 2.4 with Scala 2.11/12 branch separately, until the team is ready to move to Spark 2.4.

I havent had a chance to review @koertkuipers branch. Where is it?

its here:
https://github.com/tresata-opensource/TransmogrifAI/tree/feat-scala212
the changes are minimal

@tovbinm
Copy link
Collaborator

tovbinm commented Sep 4, 2019

@koertkuipers it seems that you had to recompile xgboost with scala 2.12. would they not accept the PR as well so that we could use the official release?

@koertkuipers
Copy link
Author

see:
dmlc/xgboost#4574
it has been merged but no release yet

@wsuchy
Copy link
Contributor

wsuchy commented May 1, 2020

@tovbinm @koertkuipers it looks like we've finally got unblocked: https://search.maven.org/artifact/ml.dmlc/xgboost-jvm_2.12

@tovbinm
Copy link
Collaborator

tovbinm commented May 4, 2020

Sweet! For a start, we need to start on the cross version build, since I assume you folks are not ready for 2.12 yet.

@nicodv
Copy link
Contributor

nicodv commented Sep 11, 2020

Baby steps: I've done some work cross-version building in a branch based on @koertkuipers fork: https://github.com/salesforce/TransmogrifAI/tree/ndv/scala212

I'm using https://github.com/ADTRAN/gradle-scala-multiversion-plugin, which seems to work well.

EDIT: this has been resolved ----> Looks like we're still blocked by MLeap (and the XGBoost component of that) for 2.12: combust/mleap#697

@nicodv
Copy link
Contributor

nicodv commented Sep 12, 2020

An observation: the only version of XGBoost that is based on 2.12 (1.2.0) also requires Spark 3 (https://github.com/dmlc/xgboost/releases). So if we want to keep using XGBoost, there may be no way to separate the Scala 2.12 upgrade from the Spark 3 upgrade.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants