Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338

Closed
wants to merge 4 commits into from

Conversation

chenhao-db
Copy link
Contributor

@chenhao-db chenhao-db commented May 2, 2024

What changes were proposed in this pull request?

There are two issues with the current variant decimal handling:

  1. The precision and scale of the BigDecimal returned by getDecimal is not checked. Based on the variant spec, they must be within the corresponding limit for DECIMAL4/8/16. An out-of-range decimal can lead to failure in the downstream Spark operations.
  2. The current schema_of_variant implementation doesn't correctly handle the case where precision is smaller than scale. Spark's DecimalType requires precision >= scale.

The Python side requires a similar fix for 1. During the fix, I found that Python error reporting was not correctly implemented (it was never tested either) and I also fixed it.

Why are the changes needed?

They are bug fixes and are required to process decimals correctly.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label May 2, 2024
Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chenhao-db Thanks! I had a question about the pyspark side.

BigDecimal result;
switch (typeInfo) {
case DECIMAL4:
result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale);
checkDecimal(result, MAX_DECIMAL4_PRECISION);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the pyspark side also need to be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python side is also fixed. During the fix, I found that the python error reporting was not correctly implemented and also fixed it.

@github-actions github-actions bot added the PYTHON label May 2, 2024
@chenhao-db chenhao-db requested a review from gene-db May 2, 2024 20:23
elif type_info == VariantUtils.DECIMAL16:
cls._check_index(pos + 17, len(value))
unscaled = int.from_bytes(value[pos + 2 : pos + 18], byteorder="little", signed=True)
cls._check_decimal(unscaled, scale, 100000000000000000000000000000000000000, 38)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make these literals into constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@gene-db gene-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes!

LGTM

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-47681][FOLLOWUP] Fix variant decimal handling. [SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. May 3, 2024
@dongjoon-hyun
Copy link
Member

// Get a decimal value from variant value `value[pos...]`.
// Throw `MALFORMED_VARIANT` if the variant is malformed.
public static BigDecimal getDecimal(byte[] value, int pos) {
checkIndex(pos, value.length);
int basicType = value[pos] & BASIC_TYPE_MASK;
int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & TYPE_INFO_MASK;
if (basicType != PRIMITIVE) throw unexpectedType(Type.DECIMAL);
int scale = value[pos + 1];
// Interpret the scale byte as unsigned. If it is a negative byte, the unsigned value must be
// greater than `MAX_DECIMAL16_PRECISION` and will trigger an error in `checkDecimal`.
Copy link
Contributor

@cloud-fan cloud-fan May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should distinguish between variant decimal spec and Spark decimal limitations. Some databases (and Java libraries) support negative decimal scale, which means the value range of decimal(precision, 0) * (10^-scale). To read such decimals, e.g. decimal(10, -5), we can turn it into Spark decimal decimal(15, 0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't allow negative scale in the variant spec either:

| decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) |
.

@@ -392,21 +392,32 @@ public static double getDouble(byte[] value, int pos) {
return Double.longBitsToDouble(readLong(value, pos + 1, 8));
}

// Check whether the precision and scale of the decimal are within the limit.
private static void checkDecimal(BigDecimal d, int maxPrecision) {
if (d.precision() > maxPrecision || d.scale() > maxPrecision) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also check precision > scale?

Copy link
Contributor Author

@chenhao-db chenhao-db May 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where we do need to distinguish between variant decimal spec and Spark decimal limitations. precision >= scale is a requirement in Spark, not in variant spec. In the new test case I add, 0.01 has a precison of 1 and a scale of 2. It is a valid decimal in the variant spec but requires some special handling in schema_of_variant.

@chenhao-db chenhao-db requested a review from cloud-fan May 6, 2024 14:19
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in d67752a May 7, 2024
JacobZheng0927 pushed a commit to JacobZheng0927/spark that referenced this pull request May 11, 2024
### What changes were proposed in this pull request?

There are two issues with the current variant decimal handling:
1. The precision and scale of the `BigDecimal` returned by `getDecimal` is not checked. Based on the variant spec, they must be within the corresponding limit for DECIMAL4/8/16. An out-of-range decimal can lead to failure in the downstream Spark operations.
2. The current `schema_of_variant` implementation doesn't correctly handle the case where precision is smaller than scale. Spark's `DecimalType` requires `precision >= scale`.

The Python side requires a similar fix for 1. During the fix, I found that Python error reporting was not correctly implemented (it was never tested either) and I also fixed it.

### Why are the changes needed?

They are bug fixes and are required to process decimals correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#46338 from chenhao-db/fix_variant_decimal.

Authored-by: Chenhao Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request May 13, 2024
### What changes were proposed in this pull request?

The PR #46338 found `schema_of_variant` sometimes could not correctly handle variant decimals and had a fix. However, I found that the fix is incomplete and `schema_of_variant` can still fail on some inputs. The reason is that `VariantUtil.getDecimal` calls `stripTrailingZeros`. For an input decimal `10.00`,  the resulting scale is -1 and the unscaled value is 1. However, negative decimal scale is not allowed by Spark. The correct approach is to use the `BigDecimal` to construct a `Decimal` and read its precision and scale, as what we did in `VariantGet`.

This PR also includes a minor change for `VariantGet`, where a duplicated expression is computed twice.

### Why are the changes needed?

They are bug fixes and are required to process decimals correctly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

More unit tests. Some of them would fail without the change in this PR (e.g., `check("10.00", "DECIMAL(2,0)")`). Others wouldn't fail, but can still enhance test coverage.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #46549 from chenhao-db/fix_decimal_schema.

Authored-by: Chenhao Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants