[WIP] Scala 2.12 / Spark 3 upgrade #550

nicodv · 2021-03-18T17:21:57Z

Related issues
#336
#332

Describe the proposed solution
Upgrade to Scala 2.12 and Spark 3

Describe alternatives you've considered
Living in the past, suffering from security issues and missing out on feature and speed improvements

Additional context
Add any other context about the changes here.

…to mt/spark-2.4

… made to decision tree pruning in Spark 2.4. If nodes are split, but both child nodes lead to the same prediction then the split is pruned away. This updates the test so this doesn't happen for feature 'b'

…to mt/spark-2.4

…spark-2.4

emitc2h · 2021-04-23T17:50:12Z

core/src/test/scala/com/salesforce/op/filters/FeatureDistributionTest.scala

@@ -192,7 +192,7 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
    val fd2 = FeatureDistribution("A", None, 20, 20, Array(2, 8, 0, 0, 12), Array.empty)
    fd1.hashCode() shouldBe fd1.hashCode()
    fd1.hashCode() shouldBe fd1.copy(summaryInfo = fd1.summaryInfo).hashCode()
-    fd1.hashCode() should not be fd1.copy(summaryInfo = Array.empty).hashCode()
+    fd1.hashCode() shouldBe fd1.copy(summaryInfo = Array.empty).hashCode()


@tovbinm I just want to make sure this is correct. In principle hashCode equality and equals should be consistent and this is what I'm trying to accomplish here, but I figured you might have had a reason for wanting something different.

I think this test was invalid.

leahmcguire

What is the reason for deleting all the join functionality from data readers in this PR?

tovbinm · 2021-04-23T18:57:26Z

@leahmcguire I think it is just not being used - #550 (comment)

leahmcguire · 2021-04-23T19:14:12Z

We should be careful in how we define unused in a public project. Also that functionality would be needed to migrate projects on Transmogrifai V0...

…ts class

…to ndv/scala212

…eration in scala 2.12

…tMapVectorizerTest

… rely on spark.ml)

…instead

emitc2h · 2021-04-30T19:46:21Z

Hey @tovbinm,there's a unit test failure I've been investigating that's the result of a bug in Spark: https://issues.apache.org/jira/browse/SPARK-34805?focusedCommentId=17337491&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17337491.

I'm wondering why testData in SanityCheckerTest.scala (L137) is constructed the way it is, with the metadata for the features column added manually. The fact that the metadata isn't passed along in DataFrame.select anymore is discovered by this assertion.

I'm assuming Spark won't fix this any time soon, and I'm having trouble finding an alternative way of putting in the metadata in the schema of testData. I've tried .withColumn, but it still relies on .select under the hood. What's your take on this?

nicodv · 2021-04-30T20:01:27Z

Also pinging @Jauntbox (we know you're out there!) for question above.

tovbinm · 2021-04-30T20:04:11Z

This is a known issue indeed. We have been copying over the metadata between fields each time we apply our transformers, e.g OpTransformer1.transform

emitc2h · 2021-04-30T20:19:42Z

This is a known issue indeed. We have been copying over the metadata between fields each time we apply our transformers, e.g OpTransformer1.transform

I mean that there is a new problem with Spark 3.1. Even OpTransformer1.transform is broken now since it relies on .select to pass back the metadata into the output dataframe. SelectedModelCombinerTest tests the .transform function directly and also fails for the same reason.

tovbinm · 2021-04-30T21:27:23Z

StructField still has the metadata in it, it's just ExpressionEncoder in Spark 3.x does not allow passing it anymore. Oh, it's a true bummer. We rely heavily on this feature.

… in OpWorkflowModelLocal

…1404)

…ution

…ata actually works well enough when all input data schema metadata is set properly

hedibejaoui · 2021-09-16T17:04:14Z

Hello, any estimation on when we can get this PR ready? Thank you!

nicodv · 2021-09-16T20:50:46Z

@hedibejaoui , we are running internal forks of TransmogrifAI and MLeap on Spark 3.1.1, so the bulk of the work has been done.

For public release, the MLeap dependency needs to be upgraded now that they're on Spark 3 too: combust/mleap#765

But since they've upgraded to Spark 3.0.2 and TransmogrifAI to 3.1.1, we have some testing left to do.

hedibejaoui · 2021-09-17T09:35:20Z

@nicodv Thanks for the information. Actually, we are using Spark 3.0.x because of some internal dependencies, any chance we get a public release of TransmogrifAI for that version?

Fatma-abdel · 2021-10-19T20:59:44Z

Hello, When do you think this PR will merged for the public use? Thank you!

EhsanSadr · 2021-10-26T15:59:03Z

Hi,
This PR adds important functionality that I need for my project. When will this PR merge ?

Thank you

MeriamAffes · 2021-10-26T16:00:20Z

Hi, we are waiting for the new PR adds. When it will be available ? Thanks

tovbinm and others added 30 commits May 30, 2019 13:48

Update to Spark 2.4.3 and XGBoost 0.90

f6264a7

special double serializer fix

685d6e1

fix serialization

e62772d

fix serialization

69247ac

docs

330bf50

fixed missng value for test

d6b0723

meta fix

63b77b5

Merge branch 'mt/spark-2.4' of github.com:salesforce/TransmogrifAI in…

4e46e31

…to mt/spark-2.4

Updated DecisionTreeNumericMapBucketizer test to deal with the change…

5a528e1

… made to decision tree pruning in Spark 2.4. If nodes are split, but both child nodes lead to the same prediction then the split is pruned away. This updates the test so this doesn't happen for feature 'b'

Merge branch 'mt/spark-2.4' of github.com:salesforce/TransmogrifAI in…

5f39603

…to mt/spark-2.4

fix params meta test

0d1a0c0

FIxed failing xgboost test

0a4f906

Merge branch 'mt/spark-2.4' of github.com:salesforce/TransmogrifAI in…

660db62

…to mt/spark-2.4

ident

3ecca64

cleanup

507503a

added dataframe reader and writer extensions

348a392

added const

f43cb26

Merge branch 'master' into mt/spark-2.4

4455034

Merge branch 'master' into mt/spark-2.4

a0978bf

build for scala 2.12

82aa188

Merge branch 'master' of github.com:salesforce/TransmogrifAI into mt/…

b27b47a

…spark-2.4

added xgboost params + update models to use public predict method

6535e4e

blarg

d1d7b9a

double ser test

ac75e15

Merge remote-tracking branch 'upstream/mt/spark-2.4' into feat-scala212

761b889

fix unit tests by have lambdas implement concrete classes

95095ed

Merge branch 'master' into feat-scala212

76b411b

remove unnecessary method defaultMatches

ecfb902

Merge branch 'master' into feat-scala212

a1a2579

Merge branch 'master' into feat-scala212

785ddc5

emitc2h reviewed Apr 23, 2021

View reviewed changes

leahmcguire reviewed Apr 23, 2021

View reviewed changes

Merge branch 'master' into ndv/scala212

9363b20

emitc2h added 7 commits April 28, 2021 09:14

Added MomentsSerializer to allow json4s to serialize Algebird's Momen…

6f7c841

…ts class

Merge branch 'ndv/scala212' of github.com:salesforce/TransmogrifAI in…

9a04faf

…to ndv/scala212

Fix random seed issues + coefficient ordering issues in ModelInsights

4f752ab

Fix expected results that changed due to changes in random number gen…

6731b9d

…eration in scala 2.12

handle nulls and missing keys in cardinality calculations in SmartTex…

b9e18ce

…tMapVectorizerTest

make test hash function consistent with OpHashingTF hashing (both now…

c42163d

… rely on spark.ml)

Don't shut down sparkContext after running a test suite, clear cache …

7082707

…instead

emitc2h added 7 commits May 3, 2021 11:00

fixing unit tests in features

355bbe2

fixing unit test failures in testkit due to rng outcome changes

2cb1827

Allow for some tolerance when comparing scores after model write/read…

fc5cdc8

… in OpWorkflowModelLocal

use legacy mode to read parquet files written with Spark 2.x (SPARK-3…

dc014fa

…1404)

Store input schema column metadata in its own param during stage exec…

f31ce9f

…ution

remove debug line

421b9bc

Rolling back most of the ColumnMetadata infra since inputSchema metad…

0038823

…ata actually works well enough when all input data schema metadata is set properly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Scala 2.12 / Spark 3 upgrade #550

[WIP] Scala 2.12 / Spark 3 upgrade #550

nicodv commented Mar 18, 2021

emitc2h Apr 23, 2021

tovbinm Apr 23, 2021

leahmcguire left a comment

tovbinm commented Apr 23, 2021 •

edited

Loading

leahmcguire commented Apr 23, 2021

emitc2h commented Apr 30, 2021

nicodv commented Apr 30, 2021

tovbinm commented Apr 30, 2021

emitc2h commented Apr 30, 2021

tovbinm commented Apr 30, 2021

hedibejaoui commented Sep 16, 2021

nicodv commented Sep 16, 2021 •

edited

Loading

hedibejaoui commented Sep 17, 2021

Fatma-abdel commented Oct 19, 2021

EhsanSadr commented Oct 26, 2021

MeriamAffes commented Oct 26, 2021

[WIP] Scala 2.12 / Spark 3 upgrade #550

Are you sure you want to change the base?

[WIP] Scala 2.12 / Spark 3 upgrade #550

Conversation

nicodv commented Mar 18, 2021

emitc2h Apr 23, 2021

Choose a reason for hiding this comment

tovbinm Apr 23, 2021

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

tovbinm commented Apr 23, 2021 • edited Loading

leahmcguire commented Apr 23, 2021

emitc2h commented Apr 30, 2021

nicodv commented Apr 30, 2021

tovbinm commented Apr 30, 2021

emitc2h commented Apr 30, 2021

tovbinm commented Apr 30, 2021

hedibejaoui commented Sep 16, 2021

nicodv commented Sep 16, 2021 • edited Loading

hedibejaoui commented Sep 17, 2021

Fatma-abdel commented Oct 19, 2021

EhsanSadr commented Oct 26, 2021

MeriamAffes commented Oct 26, 2021

tovbinm commented Apr 23, 2021 •

edited

Loading

nicodv commented Sep 16, 2021 •

edited

Loading