Support HiveHash in GPU partitioning #12192

firestarman · 2025-02-21T05:28:42Z

(No issue for this)

This PR adds the support to allow GPU hash partitioning using the same hash function as the CPU one by trying to infer the type of hash function from the CPU hash partitioning.

This is desgined for some Spark distributions supporting to specify a hash function (e.g. one of HiveHash, Murmur3Hash ...) for hash partitioning by introducing a new field named "hashingFunctionClass". It is not like the regular Spark who always use Murmur3Hash. However, to align with the regular Spark's behavior, Murmur3Hash is still the default option when inferring fails, and of course the inferrring will always fail in regular Spark.

Now only HiveHash and Murmur3Hash are supported on GPU.

This is hard to test by unit tests or integration tests. Since the customized Spark distribution is not public and it does not have any side effects to the regular Spark. But the CI at least can ensure no regressions and we already verify it by customer queries with the customized Spark.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-02-21T05:59:09Z

build

firestarman · 2025-02-21T06:22:52Z

build

pxLi · 2025-02-21T06:26:52Z

CI run hit #12194

firestarman · 2025-02-21T06:30:15Z

CI run hit #12194

Yeah, so i tried a new run .

winningsix

LGTM

firestarman · 2025-02-24T01:34:18Z

build

res-life · 2025-02-24T03:27:35Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuHashPartitioningBase.scala

+        logInfo(s"Found hash function '$hashMode' from cpu hash partitioning.")
+      } catch {
+        case _: NoSuchMethodException => // not the customized spark distributions, noop
+          logInfo("No hash function field is found in cpu hash partitioning.")


Nit: This is path for vanilla Spark, but the log seems like something is wrong.

~~I will improve this in a follow-up.~~. Updated

res-life

LGTM

firestarman · 2025-02-24T03:57:56Z

build

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-02-24T04:29:37Z

build

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-02-24T07:03:48Z

build

firestarman · 2025-02-24T07:20:30Z

build

res-life

LGTM

winningsix · 2025-02-28T03:55:26Z

@revans2 could you help take a look on this? thanks!

revans2

Mostly nits, but I am a bit concerned about not getting the proper test coverage for this feature.

revans2 · 2025-03-03T16:19:44Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuHashPartitioningBase.scala

        }
+        logInfo(s"Found hash function '$hashMode' from CPU hash partitioning.")


Can we change this logging to debug? I'm not sure we need to output this most of the time.

revans2 · 2025-03-03T16:20:08Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuHashPartitioningBase.scala

+        logInfo(s"Found hash function '$hashMode' from CPU hash partitioning.")
+      } catch {
+        case _: NoSuchMethodException => // not the customized spark distributions, noop
+          logInfo("Use murmur3 for GPU hash partitioning.")


same here. I would prefer to have this as debug logging.

revans2 · 2025-03-03T16:31:41Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala

+              val hfMeta = GpuOverrides.wrapExpr(hh, this.conf, None)
+              hfMeta.tagForGpu()
+              if (!hfMeta.canThisBeReplaced) {
+                willNotWorkOnGpu(s"the hash function: ${hh.getClass.getSimpleName}" +


My only real concern with this patch is in the testing.

https://github.com/NVIDIA/spark-rapids/blob/branch-25.04/integration_tests/src/main/python/repart_test.py

is where we test all of our hash partitioning code. It should still work fine if our hash partitioning code matches what the CPU is doing. My only concern is that the tests are written 100% with murmur3 hashing in mind and with the limitation that we don't support a hash key that is an Array with a Struct under it. So there are no tests for that. Where as the hive hash code does support it, but has a maximum nesting limit instead. At a minimum can we have a follow on issue to find a proper way to let the integration tests know what hashing mode is being used and add update the integration tests to have proper coverage?

Hmm, on top of my mind, there may be two options: 1. introduce a test-only configuration allowing manually configure HiveHash as shuffle partition strategy; 2. cover the integration test with a custom Spark which allows HiveHash as shuffle partition.

Support HiveHash in GPU partitioning

6b6a0ad

Signed-off-by: Firestarman <[email protected]>

firestarman requested review from gerashegalov, abellina, sperlingxx, winningsix, GaryShen2008, res-life and revans2 February 21, 2025 05:28

fix a build error for scala3

cdf38d2

Signed-off-by: Firestarman <[email protected]>

winningsix previously approved these changes Feb 21, 2025

View reviewed changes

res-life reviewed Feb 24, 2025

View reviewed changes

res-life previously approved these changes Feb 24, 2025

View reviewed changes

Address comments

3f90be0

Signed-off-by: Firestarman <[email protected]>

firestarman dismissed stale reviews from res-life and winningsix via 3f90be0 February 24, 2025 04:12

restore the tagging for murmur3

b6ff7f6

Signed-off-by: Firestarman <[email protected]>

res-life approved these changes Feb 26, 2025

View reviewed changes

revans2 approved these changes Mar 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support HiveHash in GPU partitioning #12192

Support HiveHash in GPU partitioning #12192

firestarman commented Feb 21, 2025 •

edited

Loading

firestarman commented Feb 21, 2025

firestarman commented Feb 21, 2025

pxLi commented Feb 21, 2025

firestarman commented Feb 21, 2025

winningsix left a comment

firestarman commented Feb 24, 2025

res-life Feb 24, 2025

firestarman Feb 24, 2025 •

edited

Loading

res-life left a comment

firestarman commented Feb 24, 2025

firestarman commented Feb 24, 2025

firestarman commented Feb 24, 2025

firestarman commented Feb 24, 2025

res-life left a comment

winningsix commented Feb 28, 2025

revans2 left a comment

revans2 Mar 3, 2025

revans2 Mar 3, 2025

revans2 Mar 3, 2025

winningsix Mar 4, 2025

		}
		logInfo(s"Found hash function '$hashMode' from CPU hash partitioning.")

Support HiveHash in GPU partitioning #12192

Are you sure you want to change the base?

Support HiveHash in GPU partitioning #12192

Conversation

firestarman commented Feb 21, 2025 • edited Loading

firestarman commented Feb 21, 2025

firestarman commented Feb 21, 2025

pxLi commented Feb 21, 2025

firestarman commented Feb 21, 2025

winningsix left a comment

Choose a reason for hiding this comment

firestarman commented Feb 24, 2025

res-life Feb 24, 2025

Choose a reason for hiding this comment

firestarman Feb 24, 2025 • edited Loading

Choose a reason for hiding this comment

res-life left a comment

Choose a reason for hiding this comment

firestarman commented Feb 24, 2025

firestarman commented Feb 24, 2025

firestarman commented Feb 24, 2025

firestarman commented Feb 24, 2025

res-life left a comment

Choose a reason for hiding this comment

winningsix commented Feb 28, 2025

revans2 left a comment

Choose a reason for hiding this comment

revans2 Mar 3, 2025

Choose a reason for hiding this comment

revans2 Mar 3, 2025

Choose a reason for hiding this comment

revans2 Mar 3, 2025

Choose a reason for hiding this comment

winningsix Mar 4, 2025

Choose a reason for hiding this comment

firestarman commented Feb 21, 2025 •

edited

Loading

firestarman Feb 24, 2025 •

edited

Loading