New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-47681][SQL][FOLLOWUP] Fix variant decimal handling. #46338
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chenhao-db Thanks! I had a question about the pyspark side.
BigDecimal result; | ||
switch (typeInfo) { | ||
case DECIMAL4: | ||
result = BigDecimal.valueOf(readLong(value, pos + 2, 4), scale); | ||
checkDecimal(result, MAX_DECIMAL4_PRECISION); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the pyspark side also need to be updated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python side is also fixed. During the fix, I found that the python error reporting was not correctly implemented and also fixed it.
python/pyspark/sql/variant_utils.py
Outdated
elif type_info == VariantUtils.DECIMAL16: | ||
cls._check_index(pos + 17, len(value)) | ||
unscaled = int.from_bytes(value[pos + 2 : pos + 18], byteorder="little", signed=True) | ||
cls._check_decimal(unscaled, scale, 100000000000000000000000000000000000000, 38) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make these literals into constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes!
LGTM
// Get a decimal value from variant value `value[pos...]`. | ||
// Throw `MALFORMED_VARIANT` if the variant is malformed. | ||
public static BigDecimal getDecimal(byte[] value, int pos) { | ||
checkIndex(pos, value.length); | ||
int basicType = value[pos] & BASIC_TYPE_MASK; | ||
int typeInfo = (value[pos] >> BASIC_TYPE_BITS) & TYPE_INFO_MASK; | ||
if (basicType != PRIMITIVE) throw unexpectedType(Type.DECIMAL); | ||
int scale = value[pos + 1]; | ||
// Interpret the scale byte as unsigned. If it is a negative byte, the unsigned value must be | ||
// greater than `MAX_DECIMAL16_PRECISION` and will trigger an error in `checkDecimal`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should distinguish between variant decimal spec and Spark decimal limitations. Some databases (and Java libraries) support negative decimal scale, which means the value range of decimal(precision, 0)
* (10^-scale)
. To read such decimals, e.g. decimal(10, -5)
, we can turn it into Spark decimal decimal(15, 0)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't allow negative scale in the variant spec either:
spark/common/variant/README.md
Line 348 in bf2e254
| decimal4 | `8` | DECIMAL(precision, scale) | 1 byte scale in range [0, 38], followed by little-endian unscaled value (see decimal table) | |
@@ -392,21 +392,32 @@ public static double getDouble(byte[] value, int pos) { | |||
return Double.longBitsToDouble(readLong(value, pos + 1, 8)); | |||
} | |||
|
|||
// Check whether the precision and scale of the decimal are within the limit. | |||
private static void checkDecimal(BigDecimal d, int maxPrecision) { | |||
if (d.precision() > maxPrecision || d.scale() > maxPrecision) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also check precision > scale
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where we do need to distinguish between variant decimal spec and Spark decimal limitations. precision >= scale
is a requirement in Spark, not in variant spec. In the new test case I add, 0.01
has a precison of 1 and a scale of 2. It is a valid decimal in the variant spec but requires some special handling in schema_of_variant
.
thanks, merging to master! |
### What changes were proposed in this pull request? There are two issues with the current variant decimal handling: 1. The precision and scale of the `BigDecimal` returned by `getDecimal` is not checked. Based on the variant spec, they must be within the corresponding limit for DECIMAL4/8/16. An out-of-range decimal can lead to failure in the downstream Spark operations. 2. The current `schema_of_variant` implementation doesn't correctly handle the case where precision is smaller than scale. Spark's `DecimalType` requires `precision >= scale`. The Python side requires a similar fix for 1. During the fix, I found that Python error reporting was not correctly implemented (it was never tested either) and I also fixed it. ### Why are the changes needed? They are bug fixes and are required to process decimals correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#46338 from chenhao-db/fix_variant_decimal. Authored-by: Chenhao Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? The PR #46338 found `schema_of_variant` sometimes could not correctly handle variant decimals and had a fix. However, I found that the fix is incomplete and `schema_of_variant` can still fail on some inputs. The reason is that `VariantUtil.getDecimal` calls `stripTrailingZeros`. For an input decimal `10.00`, the resulting scale is -1 and the unscaled value is 1. However, negative decimal scale is not allowed by Spark. The correct approach is to use the `BigDecimal` to construct a `Decimal` and read its precision and scale, as what we did in `VariantGet`. This PR also includes a minor change for `VariantGet`, where a duplicated expression is computed twice. ### Why are the changes needed? They are bug fixes and are required to process decimals correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? More unit tests. Some of them would fail without the change in this PR (e.g., `check("10.00", "DECIMAL(2,0)")`). Others wouldn't fail, but can still enhance test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46549 from chenhao-db/fix_decimal_schema. Authored-by: Chenhao Li <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
There are two issues with the current variant decimal handling:
BigDecimal
returned bygetDecimal
is not checked. Based on the variant spec, they must be within the corresponding limit for DECIMAL4/8/16. An out-of-range decimal can lead to failure in the downstream Spark operations.schema_of_variant
implementation doesn't correctly handle the case where precision is smaller than scale. Spark'sDecimalType
requiresprecision >= scale
.The Python side requires a similar fix for 1. During the fix, I found that Python error reporting was not correctly implemented (it was never tested either) and I also fixed it.
Why are the changes needed?
They are bug fixes and are required to process decimals correctly.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit tests.
Was this patch authored or co-authored using generative AI tooling?
No.