[Spark] Allow type widening for all supported type changes with Spark 4.0 #3024

johanl-db · 2024-05-02T14:28:40Z

This PR adds shims to ungate the remaining type changes that only work with Spark 4.0 / master. Spark 4.0 contains the required changes to Parquet readers to be able to read the data after applying the type changes.

Description

Extend the list of supported type changes for type widening to include changes that can be supported with Spark 4.0:

(byte, short, int) -> long
float -> double
date -> timestampNTZ
(byte, short, int) -> double
decimal -> decimal (with increased precision/scale that doesn't cause precision loss)
(byte, short, int, long) -> decimal

Shims are added to support these changes when compiling against Spark 4.0/master and to only allow byte -> short - > int when compiling against Spark 3.5.

How was this patch tested?

Adding test cases for the new type changes in the existing type widening test suites. The list of supported / unsupported changes covered in tests differs between Spark 3.5 and Spark 4.0, shims are also provided to handle this.

Does this PR introduce any user-facing changes?

Yes: allow using the listed type changes with type widening, either via ALTER TABLE CHANGE COLUMN TYPE or during schema evolution in MERGE and INSERT.

spark/src/test/scala/org/apache/spark/sql/delta/DeltaTypeWideningSuite.scala

KamilKandzia · 2024-05-02T18:11:17Z

Will be in future an option to change the column type of a table from int to string without overwriting the entire table? Unless such an option is now available (but I don't remember that)

johanl-db · 2024-05-03T07:33:55Z

Will be in future an option to change the column type of a table from int to string without overwriting the entire table? Unless such an option is now available (but I don't remember that)

There's no plan currently to support other type changes than the ones mentioned in the PR description.

Converting values when reading from a table that had one of these widening type changes applied can be easily done directly in the Parquet reader, but other type changes are harder either because:

They can lead to overflow or loss of precision. For example long -> int or long -> float.
The conversion is ambiguous in Parquet. For float -> string: how many significant digits should be displayed? For decimal -> string: should the value be padded with 0s to match the precision/scale of the value. Even for int -> string, we could ask if the raw bytes of the initial value should be returned as string or the value should be formatted as UTF8.

sabir-akhadov

looks good, left some comments

spark/src/main/scala-spark-3.5/shims/TypeWideningShims.scala

sabir-akhadov · 2024-05-23T09:03:40Z

spark/src/main/scala-spark-master/shims/TypeWideningShims.scala

+
+/**
+ * Type widening only supports a limited set of type changes with Spark 3.5 due to the parquet
+ * readers lacking the corresponding conversions that were added in Spark 4.0.


Not sure about the mechanics but shouldn't this go in a scala-spark-4.0 directory instead of master? What happens when the 4.0 is cut/released?

Spark 4.0 isn't cut yet so that's not possible, the build system only knows master and latest (3.5) currently. I imagine once spark 4.0 is cut, the scala-spark-master folder will be copied over to scala-spark-4.0

sabir-akhadov · 2024-05-23T09:06:19Z

spark/src/main/scala/org/apache/spark/sql/delta/TypeWidening.scala

-      case (ByteType | ShortType, IntegerType) => true
-      case _ => false
-    }
+    TypeWideningShims.isTypeChangeSupported(fromType, toType)


For my education, how is the TypeWideningShims object visible here?

The build script accepts an argument sparkVersion that toggles between two different build targets, each pulling its own set of shim files:
https://github.com/delta-io/delta/blob/master/build.sbt#L163

TypeWideningShims is declared in the same package as TypeWidening so it's imported implicitly

spark/src/test/scala-spark-3.5/shims/TypeWideningTestCasesShims.scala

sabir-akhadov

lgtm

… 4.0 (delta-io#3024) This PR adds shims to ungate the remaining type changes that only work with Spark 4.0 / master. Spark 4.0 contains the required changes to Parquet readers to be able to read the data after applying the type changes. ## Description Extend the list of supported type changes for type widening to include changes that can be supported with Spark 4.0: - (byte, short, int) -> long - float -> double - date -> timestampNTZ - (byte, short, int) -> double - decimal -> decimal (with increased precision/scale that doesn't cause precision loss) - (byte, short, int, long) -> decimal Shims are added to support these changes when compiling against Spark 4.0/master and to only allow `byte` -> `short` - > `int` when compiling against Spark 3.5. ## How was this patch tested? Adding test cases for the new type changes in the existing type widening test suites. The list of supported / unsupported changes covered in tests differs between Spark 3.5 and Spark 4.0, shims are also provided to handle this. ## Does this PR introduce _any_ user-facing changes? Yes: allow using the listed type changes with type widening, either via `ALTER TABLE CHANGE COLUMN TYPE` or during schema evolution in MERGE and INSERT.

Allow type widening for all supported type changes

a3cf993

johanl-db commented May 2, 2024

View reviewed changes

spark/src/test/scala/org/apache/spark/sql/delta/DeltaTypeWideningSuite.scala Outdated Show resolved Hide resolved

johanl-db added this to the 4.0.0 milestone May 2, 2024

johanl-db self-assigned this May 2, 2024

Fuzzy double comparison in tests

1cf90d3

johanl-db force-pushed the type-widening-all-types branch from 66006ff to 1cf90d3 Compare May 21, 2024 11:53

Remove redundant types in supports list

6c9f8d4

johanl-db changed the title ~~[WIP][Spark] Allow type widening for all supported type changes~~ [Spark][4.0] Allow type widening for all supported type changes May 21, 2024

johanl-db changed the title ~~[Spark][4.0] Allow type widening for all supported type changes~~ [Spark] Allow type widening for all supported type changes May 22, 2024

johanl-db changed the title ~~[Spark] Allow type widening for all supported type changes~~ [Spark] Allow type widening for all supported type changes with Spark 4.0 May 22, 2024

Add shims to work against both Spark 3.5 and 4.0

cce8b48

johanl-db force-pushed the type-widening-all-types branch from 6f6825a to cce8b48 Compare May 22, 2024 11:35

johanl-db added 3 commits May 22, 2024 16:01

Fix tests against Spark 3.5

88fb94d

Merge remote-tracking branch 'delta/master' into type-widening-all-types

83e0c84

Remove test only supported on Spark 4.0

15e3c2d

sabir-akhadov reviewed May 23, 2024

View reviewed changes

minor improvements

e4809e5

johanl-db requested a review from sabir-akhadov May 23, 2024 12:37

johanl-db added 2 commits May 23, 2024 18:25

Empty commit to retrigger CI

3fd1b27

Merge branch 'master' into type-widening-all-types

8bb2931

sabir-akhadov approved these changes May 24, 2024

View reviewed changes

allisonport-db merged commit ff5b36f into delta-io:master May 24, 2024
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark] Allow type widening for all supported type changes with Spark 4.0 #3024

[Spark] Allow type widening for all supported type changes with Spark 4.0 #3024

johanl-db commented May 2, 2024 •

edited

Loading

KamilKandzia commented May 2, 2024

johanl-db commented May 3, 2024

sabir-akhadov left a comment

sabir-akhadov May 23, 2024

johanl-db May 23, 2024

sabir-akhadov May 23, 2024

johanl-db May 23, 2024

sabir-akhadov left a comment

[Spark] Allow type widening for all supported type changes with Spark 4.0 #3024

[Spark] Allow type widening for all supported type changes with Spark 4.0 #3024

Conversation

johanl-db commented May 2, 2024 • edited Loading

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

KamilKandzia commented May 2, 2024

johanl-db commented May 3, 2024

sabir-akhadov left a comment

Choose a reason for hiding this comment

sabir-akhadov May 23, 2024

Choose a reason for hiding this comment

johanl-db May 23, 2024

Choose a reason for hiding this comment

sabir-akhadov May 23, 2024

Choose a reason for hiding this comment

johanl-db May 23, 2024

Choose a reason for hiding this comment

sabir-akhadov left a comment

Choose a reason for hiding this comment

johanl-db commented May 2, 2024 •

edited

Loading