-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54446][ML] FPGrowth supports local filesystem #53150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| Using.resource( | ||
| new ObjectInputStream(new BufferedInputStream(new FileInputStream(path))) | ||
| ) { ois => | ||
| val schema = ois.readObject().asInstanceOf[StructType] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This uses Java deserializer which seems unsafe (risk of Remote Code Execution)
Related commit: #50922
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resort to arrow format suggested by @cloud-fan
WeichenXu123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to address the RCE issue :)
b0d6e92 to
7a3db44
Compare
nit
cd9a177 to
8fe0d6e
Compare
| val spark = df.sparkSession | ||
| val schema = df.schema | ||
| val maxRecordsPerBatch = spark.sessionState.conf.arrowMaxRecordsPerBatch | ||
| df.queryExecution.executedPlan.execute().mapPartitionsInternal { iter => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dataset already has def toArrowBatchRdd, shall we reuse it?
| val schema: StructType = df.schema | ||
| dos.writeUTF(schema.json) | ||
|
|
||
| val iter = DatasetUtils.toArrowBatchRDD(df, "UTC").toLocalIterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the arrow library provide APIs to write to local file?
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow isn't intended for long term storage it's intended as a wire protocol -- I don't love using it for persisting models. I'm -0.9 on this change for now. Parquet seems like a better choice most likely.
What changes were proposed in this pull request?
FPGrowth supports local filesystem
Why are the changes needed?
to make FPGrowth work with local filesystem
Does this PR introduce any user-facing change?
yes, FPGrowth will work when local saving mode is one
How was this patch tested?
updated tests
Was this patch authored or co-authored using generative AI tooling?
no