support unspill for SpillableHostBuffer #12186

binmahone · 2025-02-20T14:46:12Z

This PR closes #12184.

It only takes care of SpillableHostBuffer by enhancing SpillableHostBufferHandle. In SpillableHostBufferHandle, materialize() is replaced by materialize(unspill: Boolean = false), so that user can express the intention to guarantee the host buffer is unspilled to host (so that he won't pay the price of reading from disk for this buffer anymore).

Internally, if set to true, an unspill step will be added to put the buffer back to host store after reading from disk (if applies). The tricky part is when materialize(true) and spill() are called at the same time, to be more specific, materialize(true) happens after spilling is set to true but disk writing has not finished. In this PR we use a flag called spillingInterrupted to indicate that the host memory buffer has been rescued and the spilling thread should, on seeing the spillingInterrupted equals true, abandon what has been already written to disk and abort current spilling procedure.

In the PR, I also tried upgrading mockito verson from 3.12.4 to 4.11.0, in the hope of it fixing a static mocking problem for me. Unfortunately it didn't work out. However I think it's still worthwhile to keep the upgrade work to keep mockito version consistent with org.scalatestplus/mockito-4-11, which literally speaking requires mockito 4.11

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

abellina · 2025-02-20T15:16:26Z

Lets add some details on how unspill works in this PR when you get a chance. I also would like to see a benchmark that shows a performance change with this PR, what is the impact of unspill in your scenario, mostly to understand how bad things really are today.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2025-02-25T02:11:49Z

Lets add some details on how unspill works in this PR when you get a chance. I also would like to see a benchmark that shows a performance change with this PR, what is the impact of unspill in your scenario, mostly to understand how bad things really are today.

Hi @abellina , I have refined my PR and updated how it works in the PR description. Please take a look.

As to the benchmark, let's say the example in #12184 (comment) (we need #12215 fixed to run this example, but it in turn depends on this current issue), if you limit the host memory under a relatively low threshold, you'll find it takes forever for the shuffle concat to finish, because it's input pieces are frequently churning.

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone · 2025-02-25T02:36:02Z

build

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

abellina · 2025-02-25T14:51:20Z

@binmahone please see my comment here: #12184 (comment). Is there a reason why this pattern change suggestion doesn't work?

revans2 · 2025-02-26T16:34:25Z

As to the benchmark, let's say the example in #12184 (comment) (we need #12215 fixed to run this example, but it in turn depends on this current issue), if you limit the host memory under a relatively low threshold, you'll find it takes forever for the shuffle concat to finish, because it's input pieces are frequently churning.

So thins brings up a very interesting point. Do we want to be able to run the concat operation when we do not have enough host memory to store the entire input shuffle table and the output?

My idea to stop the churn is in #12236 (comment) But it assumes that we want to have enough host memory to do the entire operation. If we have some kind of a requirement to do it with less memory, I would prefer it if we handled the split and retry exception and then tried to concat fewer KudoTable buffers instead.

revans2

@binmahone I think your idea of unspill and my idea of unspill are a bit different.

For me unspill implies that we can have both a DiskHandle and a HostMemoryBuffer at the same time inside of the SpillableHostBufferHandle.

So if we spill data to disk and then read it back in again we keep a copy of it cached in host.

If we go to spill and we have both disk and host, then it is simple we just release host so it can be cleaned up, no writing to disk needed.

If we run into a situation where we are in the middle of spilling and we want to materialize the buffer, then ideally we let the spill code know that it can finish writing to disk, but it should not delete the host buffer because someone grabbed it.

For me the only down side to this approach is the added complexity in the SpillFramework code and the slightly higher probability that we might need to call spill more frequently if the the life time of the SpillableHostBufferHandle is much longer than the life time of the HostMemoryBuffer that is returned by materialize.

binmahone · 2025-02-27T05:29:44Z

Do we want to be able to run the concat operation when we do not have enough host memory to store the entire input shuffle table and the output?

I think the answer is NO, that why the code pattern will be "lock all input buffers first" (checkout the snapshot in #12184 (comment)) , it will make sure all inputs are already in memory

binmahone · 2025-02-27T05:32:18Z

@binmahone I think your idea of unspill and my idea of unspill are a bit different.

For me unspill implies that we can have both a DiskHandle and a HostMemoryBuffer at the same time inside of the SpillableHostBufferHandle.

So if we spill data to disk and then read it back in again we keep a copy of it cached in host.

If we go to spill and we have both disk and host, then it is simple we just release host so it can be cleaned up, no writing to disk needed.

If we run into a situation where we are in the middle of spilling and we want to materialize the buffer, then ideally we let the spill code know that it can finish writing to disk, but it should not delete the host buffer because someone grabbed it.

For me the only down side to this approach is the added complexity in the SpillFramework code and the slightly higher probability that we might need to call spill more frequently if the the life time of the SpillableHostBufferHandle is much longer than the life time of the HostMemoryBuffer that is returned by materialize.

Sorry @revans2 I'm a little confused, it appears to me that this PR's implementation matches your description? Need some advises on the next step of this PR

revans2 · 2025-03-03T17:09:33Z

Sorry @revans2 I'm a little confused, it appears to me that this PR's implementation matches your description? Need some advises on the next step of this PR

There are a lot of places where it is not doing what I called out.

For me unspill implies that we can have both a DiskHandle and a HostMemoryBuffer at the same time inside of the SpillableHostBufferHandle.

So if we spill data to disk and then read it back in again we keep a copy of it cached in host.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

Lines 392 to 397 in 48a2011

    
           if (!closed && disk.isDefined) { 
        
             disk = None 
        
             host = Some(materialized) 
        
             materialized.incRefCount() 
        
             SpillFramework.stores.hostStore.trackNoSpill(this) 
        
           }

If we want to keep the disk around along with the host memory, then why are we setting disk to None here? If anyone asks to "unspill" this buffer the disk is gone and cannot be used again.

If we go to spill and we have both disk and host, then it is simple we just release host so it can be cleaned up, no writing to disk needed.

That code is not implemented.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

Line 413 in 48a2011

if (!closed && disk.isEmpty && host.isDefined && !spilling) {

If we have both a disk and a host, then line 413 will be false. Which in turn sets thisThreadSpills to false. Which then also means that once we are in a situation with both host and disk, the host will never be released because disk is used as the gate to see if it can be spilled.

If we run into a situation where we are in the middle of spilling and we want to materialize the buffer, then ideally we let the spill code know that it can finish writing to disk, but it should not delete the host buffer because someone grabbed it.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/spill/SpillFramework.scala

Lines 455 to 457 in 48a2011

    
           staging.foreach(_.close()) 
        
           spillingInterrupted = false 
        
           0

Technically that is true about this code, but it also does not set the disk, so if we hit a situation where we want to spill this memory again then we need to write it out to disk yet again. We don't have both at the same time.

As for next steps with this PR. You should retest with the changes that were made in #12236 and no unspilling. See if this is even needed at all. My guess is that the changes there make it so there is much less churn and this is not really needed. If the performance shows that something like this would still help, then please look at the places that I pointed out above so that unspill does what I want/expect. Then we would need follow on issues to look at all of the handles to see if there are similar changes that would be helpful for them too.

unspill in progress

21c2b19

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone requested a review from a team as a code owner February 20, 2025 14:46

binmahone marked this pull request as draft February 20, 2025 14:46

binmahone changed the title ~~support unspill for SpillableHostBuffer~~ [DO NOT REVIEW] support unspill for SpillableHostBuffer Feb 20, 2025

binmahone added 3 commits February 21, 2025 17:23

fix mockito spy

87809fb

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

pass

5db79f8

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

can review

2821828

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone added 3 commits February 25, 2025 10:12

add comment

9f5c8b4

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

Merge remote-tracking branch 'origin/branch-25.04' into unspill_host

b1e0fd6

fix license

9ef379c

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone requested a review from abellina February 25, 2025 02:21

binmahone self-assigned this Feb 25, 2025

binmahone changed the title ~~[DO NOT REVIEW] support unspill for SpillableHostBuffer~~ support unspill for SpillableHostBuffer Feb 25, 2025

for scala 2.13

54a2941

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone marked this pull request as ready for review February 25, 2025 02:26

fix 340

7ea3d9b

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

binmahone force-pushed the unspill_host branch from 2bb50b8 to 7ea3d9b Compare February 25, 2025 03:34

fix 340

48a2011

Signed-off-by: Hongbin Ma (Mahone) <[email protected]>

revans2 reviewed Feb 26, 2025

View reviewed changes

sameerz added the reliability Features to improve reliability or bugs that severly impact the reliability of the plugin label Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support unspill for SpillableHostBuffer #12186

support unspill for SpillableHostBuffer #12186

binmahone commented Feb 20, 2025 •

edited

Loading

abellina commented Feb 20, 2025

binmahone commented Feb 25, 2025 •

edited

Loading

binmahone commented Feb 25, 2025

abellina commented Feb 25, 2025 •

edited

Loading

revans2 commented Feb 26, 2025

revans2 left a comment •

edited

Loading

binmahone commented Feb 27, 2025

binmahone commented Feb 27, 2025

revans2 commented Mar 3, 2025

support unspill for SpillableHostBuffer #12186

Are you sure you want to change the base?

support unspill for SpillableHostBuffer #12186

Conversation

binmahone commented Feb 20, 2025 • edited Loading

abellina commented Feb 20, 2025

binmahone commented Feb 25, 2025 • edited Loading

binmahone commented Feb 25, 2025

abellina commented Feb 25, 2025 • edited Loading

revans2 commented Feb 26, 2025

revans2 left a comment • edited Loading

Choose a reason for hiding this comment

binmahone commented Feb 27, 2025

binmahone commented Feb 27, 2025

revans2 commented Mar 3, 2025

binmahone commented Feb 20, 2025 •

edited

Loading

binmahone commented Feb 25, 2025 •

edited

Loading

abellina commented Feb 25, 2025 •

edited

Loading

revans2 left a comment •

edited

Loading