Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

FPGrowth supports local filesystem

Why are the changes needed?

to make FPGrowth work with local filesystem

Does this PR introduce any user-facing change?

yes, FPGrowth will work when local saving mode is one

How was this patch tested?

updated tests

Was this patch authored or co-authored using generative AI tooling?

no

Using.resource(
new ObjectInputStream(new BufferedInputStream(new FileInputStream(path)))
) { ois =>
val schema = ois.readObject().asInstanceOf[StructType]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses Java deserializer which seems unsafe (risk of Remote Code Execution)

Related commit: #50922

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resort to arrow format suggested by @cloud-fan

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to address the RCE issue :)

val spark = df.sparkSession
val schema = df.schema
val maxRecordsPerBatch = spark.sessionState.conf.arrowMaxRecordsPerBatch
df.queryExecution.executedPlan.execute().mapPartitionsInternal { iter =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset already has def toArrowBatchRdd, shall we reuse it?

val schema: StructType = df.schema
dos.writeUTF(schema.json)

val iter = DatasetUtils.toArrowBatchRDD(df, "UTC").toLocalIterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the arrow library provide APIs to write to local file?

Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow isn't intended for long term storage it's intended as a wire protocol -- I don't love using it for persisting models. I'm -0.9 on this change for now. Parquet seems like a better choice most likely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants