[SPARK-48233][SS][TESTS] Tests for streaming on columns with non-default collations #46247

dbatomic · 2024-04-26T14:48:48Z

What changes were proposed in this pull request?

This change covers tests for streaming operations under columns of string type that are collated with non-utf8-binary collations. PR introduces following tests:

Non-stateful streaming for non-binary collated columns. We use UTF8_BINARY_LCASE non-binary collation as the input and assert that streaming propagates collation and that filtering behaves under rules of given collation.
Stateful streaming for binary collations. We use UNICODE collation as source and make sure that stateful operations (deduplication as taken as the example) work.
More tests that assert that stateful operations in combination with non-binary collations throw proper exception.

Why are the changes needed?

You can find more information about collation effort in document attached to root jira ticket.

This PR adds tests for basic non-stateful streaming operations with collations (e.g. filtering).

Does this PR introduce any user-facing change?

No

How was this patch tested?

PR is test only.

Was this patch authored or co-authored using generative AI tooling?

No

jose-torres · 2024-05-13T19:49:46Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala

+ val inputData = MemoryStream[(String)]
+ val result = inputData.toDF()
+ .select(col("value")
+ .try_cast(StringType("UTF8_BINARY_LCASE")).as("str"))


I'm a bit confused by this - is the test name flipped or is UTF8_BINARY_LCASE considered a non-binary collation?

AFAIK, LCASE means comparing as lowercase, so yes it's bound to non-binary equality.

That's right, UTF8_BINARY_LCASE is a non-binary collation.

In other words, "AAA" and "aaa" are considered equal, even though binary representations are clearly different.

jose-torres · 2024-05-13T19:51:44Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

+ testStream(filteredDf)(
+ StartStream(triggerClock = clock, trigger = Trigger.ProcessingTime(100)),
+ Execute { _ =>
+ spark.createDataFrame(Seq("aaa" -> 1, "AAA" -> 2, "bbb" -> 3, "aa" -> 4))


I think it's important to also test a scenario where the incoming stream has a non-default collation itself.

Not sure I understand. streamDf has non-default collation, UTF8_BINARY_LCASE. What do you mean by "incoming" stream?

+1 to what @HeartSaVioR said. Idea was exactly to use UTF8_BINARY_LCASE in the source.

Makes sense, I think I misread the test case the first time.

HeartSaVioR

Let's make clear the scope of tests we are adding here. I see the PR title is about "stateless" but you are also aware that deduplication is "stateful". While I agree that we probably won't want to add the collation test for all stateful operators, let's make the scope more clear in PR title.

HeartSaVioR · 2024-05-13T21:16:30Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala

+ val inputData = MemoryStream[(String)]
+ val result = inputData.toDF()
+ .select(col("value")
+ .try_cast(StringType("UTF8_BINARY_LCASE")).as("str"))


AFAIK, LCASE means comparing as lowercase, so yes it's bound to non-binary equality.

HeartSaVioR · 2024-05-13T21:22:39Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingDeduplicationSuite.scala

+ val inputData = MemoryStream[(String, Int)]
+ val result = inputData.toDF()
+ .select(col("_1")
+ .try_cast(StringType("UNICODE")).as("str"),


While we are here, I see UNICODE is binary equality but non-binary ordering. Does this still ensure that we can put this into RocksDB which key is binary sorted and find the key group based on prefix of key including this column?

E.g. Say we have two columns, dept (String with UNICODE collation), session start (timestamp) as grouping key, and want to scan all grouping keys which are having dept as 'dept1'. This is required for several operations like session window aggregation.

My gut feeling is yes, but I would like to double confirm.

In theory, it should work, but we need to test this as well.
binary ordering means that you can use binary representation to check which string alphabetically comes first. But if we care only about equality (which is usually used in "groupings" and joins), binary equality is only important.

I will follow up with additional testing.

Thanks, using UNICODE in grouping key of session window aggregation with RocksDB state store provider should cover it.

HeartSaVioR · 2024-05-13T21:26:47Z

sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamingQuerySuite.scala

+ testStream(filteredDf)(
+ StartStream(triggerClock = clock, trigger = Trigger.ProcessingTime(100)),
+ Execute { _ =>
+ spark.createDataFrame(Seq("aaa" -> 1, "AAA" -> 2, "bbb" -> 3, "aa" -> 4))


Not sure I understand. streamDf has non-default collation, UTF8_BINARY_LCASE. What do you mean by "incoming" stream?

dbatomic · 2024-05-14T14:32:31Z

Let's make clear the scope of tests we are adding here. I see the PR title is about "stateless" but you are also aware that deduplication is "stateful". While I agree that we probably won't want to add the collation test for all stateful operators, let's make the scope more clear in PR title.

Right, I updated both PR title and PR description. And yes, tests for collations are still pretty ad-hoc/selective.

Goal of this PR is to assert that basics work. As we create more thorough plan for collations and streaming we will start adding better organized test strategies.

Let me know if you think now is a good time to start with this. I was also thinking about creating new test suite only for collations, but that seemed like an overkill for this change.

jose-torres

Looks good from my side

HeartSaVioR

+1

HeartSaVioR · 2024-05-15T05:42:04Z

Thanks! Merging to master.

dbatomic added 2 commits April 26, 2024 16:00

e2e test for stream filtering

aca6571

Deduplication suite

61b8693

github-actions bot added SQL STRUCTURED STREAMING labels Apr 26, 2024

dbatomic changed the title ~~[Draft] - Testing of Streaming and Collations~~ [SPARK-48233][SQL][STREAMING] Tests for non-stateful streaming on columns with collations May 10, 2024

scalastyle fixes.

19387b6

dbatomic marked this pull request as ready for review May 10, 2024 13:22

jose-torres reviewed May 13, 2024

View reviewed changes

HeartSaVioR reviewed May 13, 2024

View reviewed changes

dbatomic changed the title ~~[SPARK-48233][SQL][STREAMING] Tests for non-stateful streaming on columns with collations~~ [SPARK-48233][SQL][STREAMING] Tests for streaming on columns with non-default collations May 14, 2024

jose-torres approved these changes May 14, 2024

View reviewed changes

HeartSaVioR changed the title ~~[SPARK-48233][SQL][STREAMING] Tests for streaming on columns with non-default collations~~ [SPARK-48233][SS] Tests for streaming on columns with non-default collations May 15, 2024

HeartSaVioR changed the title ~~[SPARK-48233][SS] Tests for streaming on columns with non-default collations~~ [SPARK-48233][SS][TESTS] Tests for streaming on columns with non-default collations May 15, 2024

HeartSaVioR approved these changes May 15, 2024

View reviewed changes

HeartSaVioR closed this in 7ec37e4 May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48233][SS][TESTS] Tests for streaming on columns with non-default collations #46247

[SPARK-48233][SS][TESTS] Tests for streaming on columns with non-default collations #46247

dbatomic commented Apr 26, 2024 •

edited

jose-torres May 13, 2024

HeartSaVioR May 13, 2024

dbatomic May 14, 2024

jose-torres May 13, 2024

HeartSaVioR May 13, 2024

dbatomic May 14, 2024

jose-torres May 14, 2024

HeartSaVioR left a comment

HeartSaVioR May 13, 2024

HeartSaVioR May 13, 2024

dbatomic May 14, 2024

HeartSaVioR May 15, 2024 •

edited

HeartSaVioR May 13, 2024

dbatomic commented May 14, 2024

jose-torres left a comment

HeartSaVioR left a comment

HeartSaVioR commented May 15, 2024

[SPARK-48233][SS][TESTS] Tests for streaming on columns with non-default collations #46247

[SPARK-48233][SS][TESTS] Tests for streaming on columns with non-default collations #46247

Conversation

dbatomic commented Apr 26, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HeartSaVioR May 15, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbatomic commented May 14, 2024

jose-torres left a comment

Choose a reason for hiding this comment

HeartSaVioR left a comment

Choose a reason for hiding this comment

HeartSaVioR commented May 15, 2024

dbatomic commented Apr 26, 2024 •

edited

HeartSaVioR May 15, 2024 •

edited