[SPARK-48008][WIP] Support UDAFs in Spark Connect #46245

xupefei · 2024-04-26T11:38:02Z

What changes were proposed in this pull request?

This PR changes Spark Connect to support defining and registering Aggregator[IN, BUF, OUT] UDAFs.
The mechanism is similar to supporting Scaler UDFs. On the client side, we serialize and send the Aggregator instance to the server, where the data is deserialized into an Aggregator instance recognized by Spark Core.
With this PR we now have two Aggregator interfaces defined, one in Connect API and one in Core. They define exactly the same abstract methods and share the same SerialVersionUID, so the Java serialization engine could map one to another. It is very important to keep these two definitions always in sync.

A follow-up to this PR is to add Aggregator.toColumn API (now NotImplemented due to deps to Spark Core).

Why are the changes needed?

Spark Connect does not have UDAF support. We need to fix that.

Does this PR introduce any user-facing change?

Yes, Connect users could now define an Aggregator and register it:

val agg = new Aggregator[INT, INT, INT] { ... }
spark.udf.register("agg", udaf(agg))
val ds: Dataset[Data] = ...
val aggregated = ds.selectExpr("agg(i)")

How was this patch tested?

Added new tests.

Was this patch authored or co-authored using generative AI tooling?

Nope.

hvanhovell · 2024-05-08T15:10:40Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala

@@ -49,6 +49,7 @@ import org.apache.spark.sql.execution.aggregate.TypedAggregateExpression
 * @tparam OUT The type of the final output result.
 * @since 1.6.0
 */
+@SerialVersionUID(2093413866369130093L)


Why is this needed?

TypedColumn?

It's for mapping the client's Aggregator class to this one. This is required because we now serialise the whole Aggregator instance on the client side.

I tried multiple methods and still can't succeed without this UID. As long as we want users to define UDAF like this:

new Aggregator { def merge(...) = {...} ... }

The serialized payload (either the whole Aggregator instance or individual methods (e.g., agg.merge _)) will carry a reference to the Aggregator instance that needs to be decoded on the server side. Without this UID, the decode will fail with an error message class UID not match.

hvanhovell · 2024-05-08T15:11:32Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala

+ * @since 4.0.0
+ */
+@SerialVersionUID(2093413866369130093L)
+abstract class Aggregator[-IN, BUF, OUT] extends Serializable {


Can we move this common instead of having two abstract classes?

Yes, that would be ideal. I was doing that before until I found that the Connect client should have another docstring @since 4.0.0. Could you suggest how could we document this on the client side if this class is moved to Common?

...connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto

hvanhovell · 2024-05-08T15:41:41Z

connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/Aggregator.scala

+ * Returns this `Aggregator` as a `TypedColumn` that can be used in `Dataset`.
+ * operations.
+ */
+ def toColumn: TypedColumn[IN, OUT] = {


How should this work on the connect side?

I think it should take a wildcard, we have done similar things before.

xupefei added 2 commits April 26, 2024 11:55

initial commit

cb111e1

.

f6efc14

github-actions bot added SQL CONNECT labels Apr 26, 2024

xupefei added 3 commits April 26, 2024 15:45

fix test

38b561a

buf

dbfa1a1

Merge branch 'master' of github.com:apache/spark into connect-udaf

f399de4

github-actions bot added the PYTHON label Apr 30, 2024

mima

75bab29

xupefei marked this pull request as ready for review April 30, 2024 12:42

another mima

2cf9ba9

hvanhovell reviewed May 8, 2024

View reviewed changes

...connect/client/jvm/src/main/scala/org/apache/spark/sql/expressions/UserDefinedFunction.scala Outdated Show resolved Hide resolved

hvanhovell reviewed May 8, 2024

View reviewed changes

connector/connect/common/src/main/protobuf/spark/connect/expressions.proto Outdated Show resolved Hide resolved

hvanhovell reviewed May 8, 2024

View reviewed changes

xupefei added 5 commits May 10, 2024 12:31

Merge branch 'master' of github.com:apache/spark into connect-udaf

4d26d4c

remove seperate UDAF proto, use UDF proto instead

e5e7fe6

Merge branch 'master' of github.com:apache/spark into connect-udaf

e6035bc

revert

af0157c

proto

2021859

xupefei requested a review from hvanhovell May 15, 2024 14:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48008][WIP] Support UDAFs in Spark Connect #46245

[SPARK-48008][WIP] Support UDAFs in Spark Connect #46245

xupefei commented Apr 26, 2024 •

edited

hvanhovell May 8, 2024

hvanhovell May 8, 2024

xupefei May 10, 2024 •

edited

xupefei May 10, 2024 •

edited

hvanhovell May 8, 2024

xupefei May 8, 2024

hvanhovell May 8, 2024

hvanhovell May 8, 2024

[SPARK-48008][WIP] Support UDAFs in Spark Connect #46245

Are you sure you want to change the base?

[SPARK-48008][WIP] Support UDAFs in Spark Connect #46245

Conversation

xupefei commented Apr 26, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

hvanhovell May 8, 2024

Choose a reason for hiding this comment

hvanhovell May 8, 2024

Choose a reason for hiding this comment

xupefei May 10, 2024 • edited

Choose a reason for hiding this comment

xupefei May 10, 2024 • edited

Choose a reason for hiding this comment

hvanhovell May 8, 2024

Choose a reason for hiding this comment

xupefei May 8, 2024

Choose a reason for hiding this comment

hvanhovell May 8, 2024

Choose a reason for hiding this comment

hvanhovell May 8, 2024

Choose a reason for hiding this comment

xupefei commented Apr 26, 2024 •

edited

xupefei May 10, 2024 •

edited

xupefei May 10, 2024 •

edited