[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE #46280

stefankandic · 2024-04-29T10:13:15Z

What changes were proposed in this pull request?

Changing serialization and deserialization of collated strings so that the collation information is put in the metadata of the enclosing struct field - and then read back from there during parsing.

Format of serialization will look something like this:

{
  "type": "struct",
  "fields": [
    "name": "colName",
    "type": "string",
    "nullable": true,
    "metadata": {
      "__COLLATIONS": {
        "colName": "UNICODE"
      }
    }
  ]
}

If we have a map we will add suffixes .key and .value in the metadata:

{
  "type": "struct",
  "fields": [
    {
      "name": "mapField",
      "type": {
        "type": "map",
        "keyType": "string",
        "valueType": "string",
        "valueContainsNull": true
      },
      "nullable": true,
      "metadata": {
        "__COLLATIONS": {
          "mapField.key": "UNICODE",
          "mapField.value": "UNICODE"
        }
      }
    }
  ]
}

It will be a similar story for arrays (we will add .element suffix). We could have multiple suffixes when working with deeply nested data types (Map[String, Array[Array[String]]] - see tests for this example)

Why are the changes needed?

Putting collation info in field metadata is the only way to not break old clients reading new tables with collations. CharVarcharUtils does a similar thing but this is much less hacky, and more friendly for all 3p clients - which is especially important since delta also uses spark for schema ser/de.

It will also remove the need for additional logic introduced in #46083 to remove collations before writing to HMS as this way the tables will be fully HMS compatible.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

With unit tests

Was this patch authored or co-authored using generative AI tooling?

No

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala

stefankandic · 2024-05-08T09:34:57Z

@cloud-fan please take a look when you have the time

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

python/pyspark/sql/tests/test_types.py

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala

olaky · 2024-05-15T07:34:32Z

python/pyspark/sql/tests/test_types.py

+ "nullable": true,
+ "metadata": {{
+ "{_COLLATIONS_METADATA_KEY}": {{
+ "mapField.value": "icu.UNICODE"


what about duplicate keys in this json object (should be a protocol error)

We talked about this one a bit offline, but I would rather tackle this as a separate issue than just a collation protocol error. Currently, both python and scala code will not fail when encountering duplicate keys; python will just pick one to put in the dictionary and scala will have both in the JObject. What do you think @cloud-fan ?

sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala

sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala

sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java

olaky

LGTM

stefankandic · 2024-05-17T11:51:21Z

@cloud-fan all checks passing, can we merge this?

cloud-fan · 2024-05-18T07:17:27Z

thanks, merging to master!

stefankandic · 2024-05-20T15:34:54Z

@cloud-fan I looked into HMS code a bit, and it seems that we can't save StructField metadata there, so I guess we will still have to keep converting schema with collation to schema without when creating a table in hive even though collations are no longer a type?

cloud-fan · 2024-05-20T23:19:55Z

I think so. String type with collation should be normal string type in the Hive table schema, so that other engines can still read it. We only keep the collation info in the Spark-specific table schema JSON string in table properties.

stefankandic added 4 commits April 27, 2024 04:34

initial impl of new delta schema

8edc5ea

some improvements

2f5a1b8

formatting improvements

8f6856b

minor changes

af41b50

github-actions bot added the SQL label Apr 29, 2024

stefankandic changed the title ~~[DRAFT] New delta schema~~ [DRAFT] Store collation information in metadata and not in type for SER/DE Apr 29, 2024

small refactoring

001b9e9

olaky reviewed Apr 29, 2024

View reviewed changes

sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala Outdated Show resolved Hide resolved

stefankandic added 6 commits April 29, 2024 13:34

remove bogus comment

e65285b

remove bogus comment

77e4103

fix scalastyle

e4a205b

use string as path instead of list of strings

bd600da

add tests for schema ser/de

19f5588

add python impl and tests

3a7e8f6

github-actions bot added the PYTHON label May 7, 2024

stefankandic added 2 commits May 7, 2024 17:02

merge with latest master

79e5aaa

reformat python

6ebaddf

stefankandic changed the title ~~[DRAFT] Store collation information in metadata and not in type for SER/DE~~ [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE May 7, 2024

stefankandic added 4 commits May 7, 2024 17:50

update method calls

b488346

tests passing

7747947

fix equality for struct fields

0315421

rename variables

3b9ad23

stefankandic marked this pull request as ready for review May 8, 2024 09:09

stefankandic requested a review from olaky May 8, 2024 09:34

stefankandic mentioned this pull request May 8, 2024

Protocol RFC for collations delta-io/delta#3068

Open

5 tasks

stefankandic added 3 commits May 8, 2024 13:31

fix python tests

913095c

revert changes to collationsuite

f48aa27

fix pyspark-connect and streaming test

9283b76

github-actions bot added the STRUCTURED STREAMING label May 8, 2024

cstavr approved these changes May 13, 2024

View reviewed changes

cstavr reviewed May 13, 2024

View reviewed changes

common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java Outdated Show resolved Hide resolved

stefankandic added 2 commits May 14, 2024 18:21

add missing docstring and fix mypy error

2b0a799

fix condition for valid types

aff7b58

olaky reviewed May 15, 2024

View reviewed changes

stefankandic added 2 commits May 15, 2024 16:29

respond to pr comments

c35d781

add more tests

6e379aa

stefankandic requested a review from olaky May 15, 2024 16:22

stefankandic added 4 commits May 16, 2024 10:24

make provider lowercase

0879841

fix failing test

1ae29a8

fix mypy

fdb0a7e

fix mypy

c123a4b