-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HiveHash in GPU partitioning #12192
base: branch-25.04
Are you sure you want to change the base?
Conversation
Signed-off-by: Firestarman <[email protected]>
Signed-off-by: Firestarman <[email protected]>
build |
1 similar comment
build |
CI run hit #12194 |
Yeah, so i tried a new run . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build |
logInfo(s"Found hash function '$hashMode' from cpu hash partitioning.") | ||
} catch { | ||
case _: NoSuchMethodException => // not the customized spark distributions, noop | ||
logInfo("No hash function field is found in cpu hash partitioning.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is path for vanilla Spark, but the log seems like something is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will improve this in a follow-up.. Updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
build |
Signed-off-by: Firestarman <[email protected]>
3f90be0
build |
Signed-off-by: Firestarman <[email protected]>
build |
1 similar comment
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@revans2 could you help take a look on this? thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly nits, but I am a bit concerned about not getting the proper test coverage for this feature.
} | ||
logInfo(s"Found hash function '$hashMode' from CPU hash partitioning.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this logging to debug? I'm not sure we need to output this most of the time.
logInfo(s"Found hash function '$hashMode' from CPU hash partitioning.") | ||
} catch { | ||
case _: NoSuchMethodException => // not the customized spark distributions, noop | ||
logInfo("Use murmur3 for GPU hash partitioning.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here. I would prefer to have this as debug logging.
val hfMeta = GpuOverrides.wrapExpr(hh, this.conf, None) | ||
hfMeta.tagForGpu() | ||
if (!hfMeta.canThisBeReplaced) { | ||
willNotWorkOnGpu(s"the hash function: ${hh.getClass.getSimpleName}" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only real concern with this patch is in the testing.
is where we test all of our hash partitioning code. It should still work fine if our hash partitioning code matches what the CPU is doing. My only concern is that the tests are written 100% with murmur3 hashing in mind and with the limitation that we don't support a hash key that is an Array with a Struct under it. So there are no tests for that. Where as the hive hash code does support it, but has a maximum nesting limit instead. At a minimum can we have a follow on issue to find a proper way to let the integration tests know what hashing mode is being used and add update the integration tests to have proper coverage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, on top of my mind, there may be two options: 1. introduce a test-only configuration allowing manually configure HiveHash as shuffle partition strategy; 2. cover the integration test with a custom Spark which allows HiveHash as shuffle partition.
(No issue for this)
This PR adds the support to allow GPU hash partitioning using the same hash function as the CPU one by trying to infer the type of hash function from the CPU hash partitioning.
This is desgined for some Spark distributions supporting to specify a hash function (e.g. one of
HiveHash
,Murmur3Hash
...) for hash partitioning by introducing a new field named "hashingFunctionClass". It is not like the regular Spark who always useMurmur3Hash
. However, to align with the regular Spark's behavior,Murmur3Hash
is still the default option when inferring fails, and of course the inferrring will always fail in regular Spark.Now only
HiveHash
andMurmur3Hash
are supported on GPU.This is hard to test by unit tests or integration tests. Since the customized Spark distribution is not public and it does not have any side effects to the regular Spark. But the CI at least can ensure no regressions and we already verify it by customer queries with the customized Spark.