Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Scala 2.12 / Spark 3 upgrade #550

Open
wants to merge 100 commits into
base: master
Choose a base branch
from
Open

[WIP] Scala 2.12 / Spark 3 upgrade #550

wants to merge 100 commits into from

Conversation

nicodv
Copy link
Contributor

@nicodv nicodv commented Mar 18, 2021

Related issues
#336
#332

Describe the proposed solution
Upgrade to Scala 2.12 and Spark 3

Describe alternatives you've considered
Living in the past, suffering from security issues and missing out on feature and speed improvements

Additional context
Add any other context about the changes here.

tovbinm and others added 30 commits May 30, 2019 13:48
… made to decision tree pruning in Spark 2.4. If nodes are split, but both child nodes lead to the same prediction then the split is pruned away. This updates the test so this doesn't happen for feature 'b'
@@ -192,7 +192,7 @@ class FeatureDistributionTest extends FlatSpec with PassengerSparkFixtureTest wi
val fd2 = FeatureDistribution("A", None, 20, 20, Array(2, 8, 0, 0, 12), Array.empty)
fd1.hashCode() shouldBe fd1.hashCode()
fd1.hashCode() shouldBe fd1.copy(summaryInfo = fd1.summaryInfo).hashCode()
fd1.hashCode() should not be fd1.copy(summaryInfo = Array.empty).hashCode()
fd1.hashCode() shouldBe fd1.copy(summaryInfo = Array.empty).hashCode()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tovbinm I just want to make sure this is correct. In principle hashCode equality and equals should be consistent and this is what I'm trying to accomplish here, but I figured you might have had a reason for wanting something different.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test was invalid.

Copy link
Collaborator

@leahmcguire leahmcguire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for deleting all the join functionality from data readers in this PR?

@tovbinm
Copy link
Collaborator

tovbinm commented Apr 23, 2021

@leahmcguire I think it is just not being used - #550 (comment)

@leahmcguire
Copy link
Collaborator

We should be careful in how we define unused in a public project. Also that functionality would be needed to migrate projects on Transmogrifai V0...

@emitc2h
Copy link

emitc2h commented Apr 30, 2021

Hey @tovbinm,there's a unit test failure I've been investigating that's the result of a bug in Spark: https://issues.apache.org/jira/browse/SPARK-34805?focusedCommentId=17337491&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17337491.

I'm wondering why testData in SanityCheckerTest.scala (L137) is constructed the way it is, with the metadata for the features column added manually. The fact that the metadata isn't passed along in DataFrame.select anymore is discovered by this assertion.

I'm assuming Spark won't fix this any time soon, and I'm having trouble finding an alternative way of putting in the metadata in the schema of testData. I've tried .withColumn, but it still relies on .select under the hood. What's your take on this?

@nicodv
Copy link
Contributor Author

nicodv commented Apr 30, 2021

Also pinging @Jauntbox (we know you're out there!) for question above.

@tovbinm
Copy link
Collaborator

tovbinm commented Apr 30, 2021

This is a known issue indeed. We have been copying over the metadata between fields each time we apply our transformers, e.g OpTransformer1.transform

@emitc2h
Copy link

emitc2h commented Apr 30, 2021

This is a known issue indeed. We have been copying over the metadata between fields each time we apply our transformers, e.g OpTransformer1.transform

I mean that there is a new problem with Spark 3.1. Even OpTransformer1.transform is broken now since it relies on .select to pass back the metadata into the output dataframe. SelectedModelCombinerTest tests the .transform function directly and also fails for the same reason.

@tovbinm
Copy link
Collaborator

tovbinm commented Apr 30, 2021

StructField still has the metadata in it, it's just ExpressionEncoder in Spark 3.x does not allow passing it anymore. Oh, it's a true bummer. We rely heavily on this feature.

@hedibejaoui
Copy link

Hello, any estimation on when we can get this PR ready? Thank you!

@nicodv
Copy link
Contributor Author

nicodv commented Sep 16, 2021

@hedibejaoui , we are running internal forks of TransmogrifAI and MLeap on Spark 3.1.1, so the bulk of the work has been done.

For public release, the MLeap dependency needs to be upgraded now that they're on Spark 3 too: combust/mleap#765

But since they've upgraded to Spark 3.0.2 and TransmogrifAI to 3.1.1, we have some testing left to do.

@hedibejaoui
Copy link

@nicodv Thanks for the information. Actually, we are using Spark 3.0.x because of some internal dependencies, any chance we get a public release of TransmogrifAI for that version?

@Fatma-abdel
Copy link

Hello, When do you think this PR will merged for the public use? Thank you!

@EhsanSadr
Copy link

Hi,
This PR adds important functionality that I need for my project. When will this PR merge ?

Thank you

@MeriamAffes
Copy link

Hi, we are waiting for the new PR adds. When it will be available ? Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet