Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE #46280

Closed
wants to merge 46 commits into from

Conversation

stefankandic
Copy link
Contributor

@stefankandic stefankandic commented Apr 29, 2024

What changes were proposed in this pull request?

Changing serialization and deserialization of collated strings so that the collation information is put in the metadata of the enclosing struct field - and then read back from there during parsing.

Format of serialization will look something like this:

{
  "type": "struct",
  "fields": [
    "name": "colName",
    "type": "string",
    "nullable": true,
    "metadata": {
      "__COLLATIONS": {
        "colName": "UNICODE"
      }
    }
  ]
}

If we have a map we will add suffixes .key and .value in the metadata:

{
  "type": "struct",
  "fields": [
    {
      "name": "mapField",
      "type": {
        "type": "map",
        "keyType": "string",
        "valueType": "string",
        "valueContainsNull": true
      },
      "nullable": true,
      "metadata": {
        "__COLLATIONS": {
          "mapField.key": "UNICODE",
          "mapField.value": "UNICODE"
        }
      }
    }
  ]
}

It will be a similar story for arrays (we will add .element suffix). We could have multiple suffixes when working with deeply nested data types (Map[String, Array[Array[String]]] - see tests for this example)

Why are the changes needed?

Putting collation info in field metadata is the only way to not break old clients reading new tables with collations. CharVarcharUtils does a similar thing but this is much less hacky, and more friendly for all 3p clients - which is especially important since delta also uses spark for schema ser/de.

It will also remove the need for additional logic introduced in #46083 to remove collations before writing to HMS as this way the tables will be fully HMS compatible.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

With unit tests

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Apr 29, 2024
@stefankandic stefankandic changed the title [DRAFT] New delta schema [DRAFT] Store collation information in metadata and not in type for SER/DE Apr 29, 2024
@github-actions github-actions bot added the PYTHON label May 7, 2024
@stefankandic stefankandic changed the title [DRAFT] Store collation information in metadata and not in type for SER/DE [SPARK-48175][SQL][PYTHON] Store collation information in metadata and not in type for SER/DE May 7, 2024
@stefankandic stefankandic marked this pull request as ready for review May 8, 2024 09:09
@stefankandic stefankandic requested a review from olaky May 8, 2024 09:34
@stefankandic
Copy link
Contributor Author

@cloud-fan please take a look when you have the time

python/pyspark/sql/tests/test_types.py Outdated Show resolved Hide resolved
python/pyspark/sql/tests/test_types.py Show resolved Hide resolved
"nullable": true,
"metadata": {{
"{_COLLATIONS_METADATA_KEY}": {{
"mapField.value": "icu.UNICODE"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about duplicate keys in this json object (should be a protocol error)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this one a bit offline, but I would rather tackle this as a separate issue than just a collation protocol error. Currently, both python and scala code will not fail when encountering duplicate keys; python will just pick one to put in the dictionary and scala will have both in the JObject. What do you think @cloud-fan ?

@stefankandic stefankandic requested a review from olaky May 15, 2024 16:22
Copy link
Contributor

@olaky olaky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@stefankandic
Copy link
Contributor Author

@cloud-fan all checks passing, can we merge this?

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6f6b486 May 18, 2024
@stefankandic
Copy link
Contributor Author

stefankandic commented May 20, 2024

@cloud-fan I looked into HMS code a bit, and it seems that we can't save StructField metadata there, so I guess we will still have to keep converting schema with collation to schema without when creating a table in hive even though collations are no longer a type?

@cloud-fan
Copy link
Contributor

I think so. String type with collation should be normal string type in the Hive table schema, so that other engines can still read it. We only keep the collation info in the Spark-specific table schema JSON string in table properties.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants