New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE #46280
Conversation
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala
Outdated
Show resolved
Hide resolved
@cloud-fan please take a look when you have the time |
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
"nullable": true, | ||
"metadata": {{ | ||
"{_COLLATIONS_METADATA_KEY}": {{ | ||
"mapField.value": "icu.UNICODE" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about duplicate keys in this json object (should be a protocol error)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this one a bit offline, but I would rather tackle this as a separate issue than just a collation protocol error. Currently, both python and scala code will not fail when encountering duplicate keys; python will just pick one to put in the dictionary and scala will have both in the JObject
. What do you think @cloud-fan ?
sql/api/src/main/scala/org/apache/spark/sql/types/StringType.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/types/StructTypeSuite.scala
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala
Outdated
Show resolved
Hide resolved
sql/api/src/main/scala/org/apache/spark/sql/types/StructField.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/types/DataTypeSuite.scala
Show resolved
Hide resolved
common/unsafe/src/main/java/org/apache/spark/sql/catalyst/util/CollationFactory.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@cloud-fan all checks passing, can we merge this? |
thanks, merging to master! |
@cloud-fan I looked into HMS code a bit, and it seems that we can't save StructField metadata there, so I guess we will still have to keep converting schema with collation to schema without when creating a table in hive even though collations are no longer a type? |
I think so. String type with collation should be normal string type in the Hive table schema, so that other engines can still read it. We only keep the collation info in the Spark-specific table schema JSON string in table properties. |
What changes were proposed in this pull request?
Changing serialization and deserialization of collated strings so that the collation information is put in the metadata of the enclosing struct field - and then read back from there during parsing.
Format of serialization will look something like this:
If we have a map we will add suffixes
.key
and.value
in the metadata:It will be a similar story for arrays (we will add
.element
suffix). We could have multiple suffixes when working with deeply nested data types (Map[String, Array[Array[String]]] - see tests for this example)Why are the changes needed?
Putting collation info in field metadata is the only way to not break old clients reading new tables with collations.
CharVarcharUtils
does a similar thing but this is much less hacky, and more friendly for all 3p clients - which is especially important since delta also uses spark for schema ser/de.It will also remove the need for additional logic introduced in #46083 to remove collations before writing to HMS as this way the tables will be fully HMS compatible.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
With unit tests
Was this patch authored or co-authored using generative AI tooling?
No