[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338

chenhao-db · 2024-05-02T18:47:16Z

What changes were proposed in this pull request?

There are two issues with the current variant decimal handling:

The precision and scale of the BigDecimal returned by getDecimal is not checked. Based on the variant spec, they must be within the corresponding limit for DECIMAL4/8/16. An out-of-range decimal can lead to failure in the downstream Spark operations.
The current schema_of_variant implementation doesn't correctly handle the case where precision is smaller than scale. Spark's DecimalType requires precision >= scale.

The Python side requires a similar fix for 1. During the fix, I found that Python error reporting was not correctly implemented (it was never tested either) and I also fixed it.

Why are the changes needed?

They are bug fixes and are required to process decimals correctly.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

gene-db

@chenhao-db Thanks! I had a question about the pyspark side.

gene-db · 2024-05-02T19:15:17Z

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

 BigDecimal result;
 switch (typeInfo) {
 case DECIMAL4:
 result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale);
+ checkDecimal(result, MAX_DECIMAL4_PRECISION);


Does the pyspark side also need to be updated?

I think so.

The python side is also fixed. During the fix, I found that the python error reporting was not correctly implemented and also fixed it.

gene-db · 2024-05-02T20:26:47Z

python/pyspark/sql/variant_utils.py

 elif type_info == VariantUtils.DECIMAL16:
 cls._check_index(pos + 17, len(value))
 unscaled = int.from_bytes(value[pos + 2 : pos + 18], byteorder="little", signed=True)
+ cls._check_decimal(unscaled, scale, 100000000000000000000000000000000000000, 38)


Let's make these literals into constants.

gene-db

Thanks for the fixes!

LGTM

dongjoon-hyun · 2024-05-03T03:04:29Z

cc @cloud-fan from

[SPARK-47681][SQL] Add schema_of_variant expression. #45806

cloud-fan · 2024-05-05T14:17:34Z

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

 // Get a decimal value from variant value `value[pos...]`.
 // Throw `MALFORMED_VARIANT` if the variant is malformed.
 public static BigDecimal getDecimal(byte[] value, int pos) {
 checkIndex(pos, value.length);
 int basicType = value[pos] & BASIC_TYPE_MASK;
 int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & TYPE_INFO_MASK;
 if (basicType != PRIMITIVE) throw unexpectedType(Type.DECIMAL);
- int scale = value[pos + 1];
+ // Interpret the scale byte as unsigned. If it is a negative byte, the unsigned value must be
+ // greater than `MAX_DECIMAL16_PRECISION` and will trigger an error in `checkDecimal`.


We should distinguish between variant decimal spec and Spark decimal limitations. Some databases (and Java libraries) support negative decimal scale, which means the value range of decimal(precision, 0) * (10^-scale). To read such decimals, e.g. decimal(10, -5), we can turn it into Spark decimal decimal(15, 0).

cloud-fan · 2024-05-06T08:21:37Z

common/variant/src/main/java/org/apache/spark/types/variant/VariantUtil.java

@@ -392,21 +392,32 @@ public static double getDouble(byte[] value, int pos) {
 return Double.longBitsToDouble(readLong(value, pos + 1, 8));
 }

+ // Check whether the precision and scale of the decimal are within the limit.
+ private static void checkDecimal(BigDecimal d, int maxPrecision) {
+ if (d.precision() > maxPrecision || d.scale() > maxPrecision) {


shall we also check precision > scale?

This is where we do need to distinguish between variant decimal spec and Spark decimal limitations. precision >= scale is a requirement in Spark, not in variant spec. In the new test case I add, 0.01 has a precison of 1 and a scale of 2. It is a valid decimal in the variant spec but requires some special handling in schema_of_variant.

cloud-fan · 2024-05-07T00:10:02Z

thanks, merging to master!

### What changes were proposed in this pull request? There are two issues with the current variant decimal handling: 1. The precision and scale of the `BigDecimal` returned by `getDecimal` is not checked. Based on the variant spec, they must be within the corresponding limit for DECIMAL4/8/16. An out-of-range decimal can lead to failure in the downstream Spark operations. 2. The current `schema_of_variant` implementation doesn't correctly handle the case where precision is smaller than scale. Spark's `DecimalType` requires `precision >= scale`. The Python side requires a similar fix for 1. During the fix, I found that Python error reporting was not correctly implemented (it was never tested either) and I also fixed it. ### Why are the changes needed? They are bug fixes and are required to process decimals correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46338 from chenhao-db/fix_variant_decimal. Authored-by: Chenhao Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? The PR #46338 found `schema_of_variant` sometimes could not correctly handle variant decimals and had a fix. However, I found that the fix is incomplete and `schema_of_variant` can still fail on some inputs. The reason is that `VariantUtil.getDecimal` calls `stripTrailingZeros`. For an input decimal `10.00`, the resulting scale is -1 and the unscaled value is 1. However, negative decimal scale is not allowed by Spark. The correct approach is to use the `BigDecimal` to construct a `Decimal` and read its precision and scale, as what we did in `VariantGet`. This PR also includes a minor change for `VariantGet`, where a duplicated expression is computed twice. ### Why are the changes needed? They are bug fixes and are required to process decimals correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? More unit tests. Some of them would fail without the change in this PR (e.g., `check("10.00", "DECIMAL(2,0)")`). Others wouldn't fail, but can still enhance test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46549 from chenhao-db/fix_decimal_schema. Authored-by: Chenhao Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

initial

e64e680

github-actions bot added the SQL label May 2, 2024

gene-db reviewed May 2, 2024

View reviewed changes

fix python

56b814c

github-actions bot added the PYTHON label May 2, 2024

chenhao-db requested a review from gene-db May 2, 2024 20:23

gene-db reviewed May 2, 2024

View reviewed changes

make constants

bf4ca0a

gene-db approved these changes May 2, 2024

View reviewed changes

fix lint

c272329

dongjoon-hyun changed the title ~~[SPARK-47681][FOLLOWUP] Fix variant decimal handling.~~ [SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. May 3, 2024

cloud-fan reviewed May 5, 2024

View reviewed changes

cloud-fan reviewed May 6, 2024

View reviewed changes

chenhao-db requested a review from cloud-fan May 6, 2024 14:19

cloud-fan closed this in d67752a May 7, 2024

chenhao-db mentioned this pull request May 13, 2024

[SPARK-47681][FOLLOWUP] Fix schema_of_variant(decimal). #46549

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338

[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338

chenhao-db commented May 2, 2024 •

edited

gene-db left a comment

gene-db May 2, 2024

chenhao-db May 2, 2024

chenhao-db May 2, 2024

gene-db May 2, 2024

chenhao-db May 2, 2024

gene-db left a comment

dongjoon-hyun commented May 3, 2024

cloud-fan May 5, 2024 •

edited

chenhao-db May 5, 2024

cloud-fan May 6, 2024

chenhao-db May 6, 2024 •

edited

cloud-fan commented May 7, 2024

[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338

[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338

Conversation

chenhao-db commented May 2, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

gene-db left a comment

Choose a reason for hiding this comment

gene-db May 2, 2024

Choose a reason for hiding this comment

chenhao-db May 2, 2024

Choose a reason for hiding this comment

chenhao-db May 2, 2024

Choose a reason for hiding this comment

gene-db May 2, 2024

Choose a reason for hiding this comment

chenhao-db May 2, 2024

Choose a reason for hiding this comment

gene-db left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented May 3, 2024

cloud-fan May 5, 2024 • edited

Choose a reason for hiding this comment

chenhao-db May 5, 2024

Choose a reason for hiding this comment

cloud-fan May 6, 2024

Choose a reason for hiding this comment

chenhao-db May 6, 2024 • edited

Choose a reason for hiding this comment

cloud-fan commented May 7, 2024

chenhao-db commented May 2, 2024 •

edited

cloud-fan May 5, 2024 •

edited

chenhao-db May 6, 2024 •

edited