Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HiveHash in GPU partitioning #12192

Open
wants to merge 4 commits into
base: branch-25.04
Choose a base branch
from

Conversation

firestarman
Copy link
Collaborator

@firestarman firestarman commented Feb 21, 2025

(No issue for this)

This PR adds the support to allow GPU hash partitioning using the same hash function as the CPU one by trying to infer the type of hash function from the CPU hash partitioning.

This is desgined for some Spark distributions supporting to specify a hash function (e.g. one of HiveHash, Murmur3Hash ...) for hash partitioning by introducing a new field named "hashingFunctionClass". It is not like the regular Spark who always use Murmur3Hash. However, to align with the regular Spark's behavior, Murmur3Hash is still the default option when inferring fails, and of course the inferrring will always fail in regular Spark.

Now only HiveHash and Murmur3Hash are supported on GPU.

This is hard to test by unit tests or integration tests. Since the customized Spark distribution is not public and it does not have any side effects to the regular Spark. But the CI at least can ensure no regressions and we already verify it by customer queries with the customized Spark.

Signed-off-by: Firestarman <[email protected]>
@firestarman
Copy link
Collaborator Author

build

1 similar comment
@firestarman
Copy link
Collaborator Author

build

@pxLi
Copy link
Member

pxLi commented Feb 21, 2025

CI run hit #12194

@firestarman
Copy link
Collaborator Author

CI run hit #12194

Yeah, so i tried a new run .

winningsix
winningsix previously approved these changes Feb 21, 2025
Copy link
Collaborator

@winningsix winningsix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@firestarman
Copy link
Collaborator Author

build

logInfo(s"Found hash function '$hashMode' from cpu hash partitioning.")
} catch {
case _: NoSuchMethodException => // not the customized spark distributions, noop
logInfo("No hash function field is found in cpu hash partitioning.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is path for vanilla Spark, but the log seems like something is wrong.

Copy link
Collaborator Author

@firestarman firestarman Feb 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will improve this in a follow-up.. Updated

res-life
res-life previously approved these changes Feb 24, 2025
Copy link
Collaborator

@res-life res-life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@firestarman
Copy link
Collaborator Author

build

Signed-off-by: Firestarman <[email protected]>
@firestarman firestarman dismissed stale reviews from res-life and winningsix via 3f90be0 February 24, 2025 04:12
@firestarman
Copy link
Collaborator Author

build

@firestarman
Copy link
Collaborator Author

build

1 similar comment
@firestarman
Copy link
Collaborator Author

build

Copy link
Collaborator

@res-life res-life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@winningsix
Copy link
Collaborator

@revans2 could you help take a look on this? thanks!

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits, but I am a bit concerned about not getting the proper test coverage for this feature.

}
logInfo(s"Found hash function '$hashMode' from CPU hash partitioning.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change this logging to debug? I'm not sure we need to output this most of the time.

logInfo(s"Found hash function '$hashMode' from CPU hash partitioning.")
} catch {
case _: NoSuchMethodException => // not the customized spark distributions, noop
logInfo("Use murmur3 for GPU hash partitioning.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. I would prefer to have this as debug logging.

val hfMeta = GpuOverrides.wrapExpr(hh, this.conf, None)
hfMeta.tagForGpu()
if (!hfMeta.canThisBeReplaced) {
willNotWorkOnGpu(s"the hash function: ${hh.getClass.getSimpleName}" +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only real concern with this patch is in the testing.

https://github.com/NVIDIA/spark-rapids/blob/branch-25.04/integration_tests/src/main/python/repart_test.py

is where we test all of our hash partitioning code. It should still work fine if our hash partitioning code matches what the CPU is doing. My only concern is that the tests are written 100% with murmur3 hashing in mind and with the limitation that we don't support a hash key that is an Array with a Struct under it. So there are no tests for that. Where as the hive hash code does support it, but has a maximum nesting limit instead. At a minimum can we have a follow on issue to find a proper way to let the integration tests know what hashing mode is being used and add update the integration tests to have proper coverage?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on top of my mind, there may be two options: 1. introduce a test-only configuration allowing manually configure HiveHash as shuffle partition strategy; 2. cover the integration test with a custom Spark which allows HiveHash as shuffle partition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants