Tweak CArena Defragmentation Strategy #4531

WeiqunZhang · 2025-06-27T22:59:37Z

The previous strategy added in #4451 has a flaw. Suppose a CArena's initial size is small and we have n vectors each with a size of x. Now we are resizing these vectors one by one to size x+y, where y << x. Then we would end up with n new allocations each with a size of 2*x+y. We have doubled the memory usage in the end, because the unused spaces can not be combined.

In the new strategy, we only attempt to combine allocations when the combined amount is not less than the requested amount of allocation.

We also check the malloc error code now. If it fails, we will try to free more memory and call malloc again.

The previous strategy has a flaw. Suppose a CArena's initial size is small and we have n vectors each with a size of x. Now we are resizing these vectors one by one to size x+y, where y << x. Then we would end up with n new allocations each with a size of 2*x+y. We have doubled the memory usage in the end, because the unused spaces can not be combined. In the new strategy, we only attempt to combine uncombined allocations when the combined amount is not less than the requested amount of allocation. We also check the malloc error code now. If it fails, we will try to free more memory and call malloc again.

AlexanderSinn · 2025-06-28T02:21:17Z

Pseudocode of the operation:

x = // big
y = // small

for (n) {
    alloc(x);
}

for (n) {
    // like vector resize
    alloc(x+y);
    free(x);
}
// test memory here
for (n) {
    free(x+y);
}

What I think the various options should end up with:
No defrag:

[x] * n, [x+y] * n

Current defrag:

[x], [x+y], [2*x+y] * (n-1)

This PR:

[x] * 2, [2*x] * (n/2-1), [x+y] * (n/2+1)

Current defrag, but with const std::size_t N = std::max(m_hunk, std::max(freed_bytes, nbytes));:

[x], [x+y] * n

So I think the last option should be best.

ax3l · 2025-06-28T03:34:56Z

WarpX AMD MI300A (128GB) benchmarks with amrex.the_arena_init_size=1 on 4 GPUs with 8 boxes per GPU, 1 particle species with 8 ppc, warm and homogeneous plasma.

`development`

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [50917 ... 70859]
[The         Arena] max space (MB) allocated spread across MPI: [58791 ... 79206]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

83GB peak used: (other 3 ranks show: 80GB, 62GB, and 68GB peak)

calls to hipMalloc: 126

This PR

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [54541 ... 73431]
[The         Arena] max space (MB) allocated spread across MPI: [56231 ... 74598]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

78.5GB peak used: (other 3 ranks show: 76GB, 59GB, and 67GB peak)

calls to hipMalloc: 151

`development` with Alex' 1-line Patch

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [72057 ... 79177]
[The         Arena] max space (MB) allocated spread across MPI: [48935 ... 56551]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

63GB peak used (but blows up way slower than the above): (other 3 ranks show: 60.7GB, 55GB, and 56.5GB peak)

calls to hipMalloc: 358

ax3l · 2025-06-28T04:02:53Z

In contrast: adding amrex.the_arena_release_threshold = 0 to development with init size 1 as above:

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [90153 ... 90185]
[The         Arena] max space (MB) allocated spread across MPI: [40483 ... 40547]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

53GB peak used (but going up slower than development and this PR): (other 3 ranks show: 57GB, 46GB (and going down again), and 54.5GB peak)

calls to hipMalloc: 4096

But massive performance hit (about 2x).

Imho 20GB overhead over 40GB "real" and well-distributed sim size is still too much, and maybe using an absolute growth size instead of a relative factor can help and/or optimizing the buffer/remote unpack routines.

WeiqunZhang · 2025-06-28T04:20:32Z

@AlexanderSinn But const std::size_t N = std::max(m_hunk, std::max(freed_bytes, nbytes)) defeats the purpose of using memory arena. For this test (https://github.com/WeiqunZhang/amrex-devtests/tree/main/defrag), using const std::size_t N = std::max(m_hunk, std::max(freed_bytes, nbytes)) will result in about 1500 calls of cudaMalloc, whereas with this PR, it's only 85.

WeiqunZhang · 2025-06-28T04:25:25Z

@ax3l I thought with development your test ran out of memory.

ax3l · 2025-06-28T04:27:10Z

Yes, on 512 nodes. I am good on 1 node. This test is on one node.

ax3l · 2025-06-28T04:27:55Z

I added the number of calls to hipMalloc above.

all but the first entry use init size of 1

patch	allocs	frees	runtime	peak GB
development w/ "default arena_init_size"	9	12	65.86	103 (constant)
development	126	129	76.48	83
this PR	151	154	75.97	78.5
Alex' patch	358	361	92.28	63
development w/ release threshold 0	4096	3927	166.19	57

WeiqunZhang force-pushed the defrag_strategy branch from f021d7c to 7ae5513 Compare June 27, 2025 23:10

WeiqunZhang requested review from ax3l and AlexanderSinn June 27, 2025 23:12

ax3l added the enhancement label Jun 27, 2025

ax3l added the performance label Jun 28, 2025

ax3l mentioned this pull request Jun 28, 2025

Tuolumne (LLNL): AMReX Memory Release BLAST-WarpX/warpx#5982

Open

AlexanderSinn mentioned this pull request Jun 28, 2025

Use geometric grow for resize in RedistributeGPU #4532

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tweak CArena Defragmentation Strategy #4531

Tweak CArena Defragmentation Strategy #4531

Uh oh!

WeiqunZhang commented Jun 27, 2025 •

edited

Loading

Uh oh!

AlexanderSinn commented Jun 28, 2025

Uh oh!

ax3l commented Jun 28, 2025 •

edited

Loading

Uh oh!

ax3l commented Jun 28, 2025 •

edited

Loading

Uh oh!

WeiqunZhang commented Jun 28, 2025

Uh oh!

WeiqunZhang commented Jun 28, 2025

Uh oh!

ax3l commented Jun 28, 2025 •

edited

Loading

Uh oh!

ax3l commented Jun 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Tweak CArena Defragmentation Strategy #4531

Are you sure you want to change the base?

Tweak CArena Defragmentation Strategy #4531

Uh oh!

Conversation

WeiqunZhang commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexanderSinn commented Jun 28, 2025

Uh oh!

ax3l commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

development

This PR

development with Alex' 1-line Patch

Uh oh!

ax3l commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WeiqunZhang commented Jun 28, 2025

Uh oh!

WeiqunZhang commented Jun 28, 2025

Uh oh!

ax3l commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ax3l commented Jun 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

WeiqunZhang commented Jun 27, 2025 •

edited

Loading

ax3l commented Jun 28, 2025 •

edited

Loading

`development`

`development` with Alex' 1-line Patch

ax3l commented Jun 28, 2025 •

edited

Loading

ax3l commented Jun 28, 2025 •

edited

Loading

ax3l commented Jun 28, 2025 •

edited

Loading