Support to retry the host allocation for hybrid converters #12251

firestarman · 2025-03-04T02:12:35Z

Contributes to #8874

This PR adds the retry protection to host memory allocations used by the C2C converters in Hybrid scans. This is done by introducing a new class named HybridHostRetryAllocator who implements the Hybrid host allocator interface with retry support. It will close the allocated buffers just before each retry starts, to leave more memory for higher priority tasks.

This change also introduces another new trait named HostRetryAllocator to extract the retry code from the hybrid things to avoid loading hybrid jar when running the new unit tests.

This is to try to resolve the following kind of OOMs:

com.nvidia.spark.rapids.jni.CpuRetryOOM: Could not complete allocation after 1000 retries
        at com.nvidia.spark.rapids.HostAlloc.alloc(HostAlloc.scala:272)
        at com.nvidia.spark.rapids.HostAlloc.allocate(HostAlloc.scala:278)
        at ai.rapids.cudf.HostMemoryBuffer.allocate(HostMemoryBuffer.java:138)
        at com.nvidia.spark.rapids.velox.VeloxBatchConverter.createVectorBuilder(VeloxBatchConverter.scala:337)
        at com.nvidia.spark.rapids.velox.VeloxBatchConverter.$anonfun$resetTargetBuffers$3(VeloxBatchConverter.scala:236)
        at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
        at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
        at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
        at com.nvidia.spark.rapids.velox.VeloxBatchConverter.resetTargetBuffers(VeloxBatchConverter.scala:232)
        at com.nvidia.spark.rapids.velox.VeloxBatchConverter.<init>(VeloxBatchConverter.scala:183)
        at com.nvidia.spark.rapids.velox.VeloxBatchConverter$.apply(VeloxBatchConverter.scala:384)

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-03-04T02:41:13Z

build

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-03-04T03:02:58Z

build

res-life · 2025-03-04T03:11:42Z

tests/pom.xml

+        <dependency>
+            <groupId>com.nvidia</groupId>
+            <artifactId>rapids-4-spark-hybrid_${scala.binary.version}</artifactId>
+            <version>${project.version}</version>


Please use the version of hybrid itself like other places:

<dependency> <groupId>com.nvidia</groupId> <artifactId>rapids-4-spark-hybrid_${scala.binary.version}</artifactId> <version>${spark-rapids-hybrid.version}</version> <scope>provided</scope> </dependency>

Like private jar, it has its own version.
During the release process, the project.version is not always equal to spark-rapids-hybrid.version.

And set scope as <scope>test</scope>?

updated. I created an item for hyrid jar in the dependencyManagement section in the root pom to unify its version for sub modules.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-03-04T04:54:42Z

build

res-life

LGTM

Signed-off-by: Firestarman <[email protected]>

firestarman · 2025-03-04T08:12:21Z

build

sperlingxx

LGTM

Support to retry the host allocation for hybrid converters

6fa68d7

Signed-off-by: Firestarman <[email protected]>

firestarman requested a review from a team as a code owner March 4, 2025 02:12

fix a build error for scala2.13

da57860

Signed-off-by: Firestarman <[email protected]>

firestarman requested review from abellina, winningsix, sperlingxx, binmahone and res-life March 4, 2025 02:32

a small refactor

34f901f

Signed-off-by: Firestarman <[email protected]>

res-life reviewed Mar 4, 2025

View reviewed changes

firestarman added 2 commits March 4, 2025 12:02

hybrid version correctness

15ce60c

Signed-off-by: Firestarman <[email protected]>

fix a scala2.13 build error

116a1f0

Signed-off-by: Firestarman <[email protected]>

res-life previously approved these changes Mar 4, 2025

View reviewed changes

firestarman added 2 commits March 4, 2025 15:31

Refactor for tests

06f9087

Signed-off-by: Firestarman <[email protected]>

rename the test file

4c5b510

Signed-off-by: Firestarman <[email protected]>

firestarman dismissed res-life’s stale review via 4c5b510 March 4, 2025 07:41

sperlingxx approved these changes Mar 4, 2025

View reviewed changes

res-life approved these changes Mar 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support to retry the host allocation for hybrid converters #12251

Support to retry the host allocation for hybrid converters #12251

firestarman commented Mar 4, 2025 •

edited

Loading

firestarman commented Mar 4, 2025

firestarman commented Mar 4, 2025

res-life Mar 4, 2025

firestarman Mar 4, 2025

firestarman Mar 4, 2025

firestarman commented Mar 4, 2025

res-life left a comment

firestarman commented Mar 4, 2025

sperlingxx left a comment

Support to retry the host allocation for hybrid converters #12251

Are you sure you want to change the base?

Support to retry the host allocation for hybrid converters #12251

Conversation

firestarman commented Mar 4, 2025 • edited Loading

firestarman commented Mar 4, 2025

firestarman commented Mar 4, 2025

res-life Mar 4, 2025

Choose a reason for hiding this comment

firestarman Mar 4, 2025

Choose a reason for hiding this comment

firestarman Mar 4, 2025

Choose a reason for hiding this comment

firestarman commented Mar 4, 2025

res-life left a comment

Choose a reason for hiding this comment

firestarman commented Mar 4, 2025

sperlingxx left a comment

Choose a reason for hiding this comment

firestarman commented Mar 4, 2025 •

edited

Loading