Skip to content

Tweak CArena Defragmentation Strategy #4531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: development
Choose a base branch
from

Conversation

WeiqunZhang
Copy link
Member

@WeiqunZhang WeiqunZhang commented Jun 27, 2025

The previous strategy added in #4451 has a flaw. Suppose a CArena's initial size is small and we have n vectors each with a size of x. Now we are resizing these vectors one by one to size x+y, where y << x. Then we would end up with n new allocations each with a size of 2*x+y. We have doubled the memory usage in the end, because the unused spaces can not be combined.

In the new strategy, we only attempt to combine allocations when the combined amount is not less than the requested amount of allocation.

We also check the malloc error code now. If it fails, we will try to free more memory and call malloc again.

The previous strategy has a flaw. Suppose a CArena's initial size is small
and we have n vectors each with a size of x. Now we are resizing these
vectors one by one to size x+y, where y << x. Then we would end up with n
new allocations each with a size of 2*x+y. We have doubled the memory usage
in the end, because the unused spaces can not be combined.

In the new strategy, we only attempt to combine uncombined allocations when
the combined amount is not less than the requested amount of allocation.

We also check the malloc error code now. If it fails, we will try to free
more memory and call malloc again.
@AlexanderSinn
Copy link
Member

Pseudocode of the operation:

x = // big
y = // small

for (n) {
    alloc(x);
}

for (n) {
    // like vector resize
    alloc(x+y);
    free(x);
}
// test memory here
for (n) {
    free(x+y);
}

What I think the various options should end up with:
No defrag:

[x] * n, [x+y] * n

Current defrag:

[x], [x+y], [2*x+y] * (n-1)

This PR:

[x] * 2, [2*x] * (n/2-1), [x+y] * (n/2+1)

Current defrag, but with const std::size_t N = std::max(m_hunk, std::max(freed_bytes, nbytes));:

[x], [x+y] * n

So I think the last option should be best.

@ax3l
Copy link
Member

ax3l commented Jun 28, 2025

WarpX AMD MI300A (128GB) benchmarks with amrex.the_arena_init_size=1 on 4 GPUs with 8 boxes per GPU, 1 particle species with 8 ppc, warm and homogeneous plasma.

development

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [50917 ... 70859]
[The         Arena] max space (MB) allocated spread across MPI: [58791 ... 79206]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

83GB peak used: (other 3 ranks show: 80GB, 62GB, and 68GB peak)
image
calls to hipMalloc: 126

This PR

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [54541 ... 73431]
[The         Arena] max space (MB) allocated spread across MPI: [56231 ... 74598]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

78.5GB peak used: (other 3 ranks show: 76GB, 59GB, and 67GB peak)
image
calls to hipMalloc: 151

development with Alex' 1-line Patch

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [72057 ... 79177]
[The         Arena] max space (MB) allocated spread across MPI: [48935 ... 56551]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

63GB peak used (but blows up way slower than the above): (other 3 ranks show: 60.7GB, 55GB, and 56.5GB peak)
image
calls to hipMalloc: 358

@ax3l
Copy link
Member

ax3l commented Jun 28, 2025

In contrast: adding amrex.the_arena_release_threshold = 0 to development with init size 1 as above:

Total GPU global memory (MB) spread across MPI: [131072 ... 131072]
Free  GPU global memory (MB) spread across MPI: [90153 ... 90185]
[The         Arena] max space (MB) allocated spread across MPI: [40483 ... 40547]
[The         Arena] max space (MB) used      spread across MPI: [40480 ... 40540]
[The Managed Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The Managed Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The  Pinned Arena] max space (MB) allocated spread across MPI: [8 ... 8]
[The  Pinned Arena] max space (MB) used      spread across MPI: [0 ... 0]
[The   Comms Arena] max space (MB) allocated spread across MPI: [284 ... 284]
[The   Comms Arena] max space (MB) used      spread across MPI: [127 ... 127]

53GB peak used (but going up slower than development and this PR): (other 3 ranks show: 57GB, 46GB (and going down again), and 54.5GB peak)
image
calls to hipMalloc: 4096

But massive performance hit (about 2x).

Imho 20GB overhead over 40GB "real" and well-distributed sim size is still too much, and maybe using an absolute growth size instead of a relative factor can help and/or optimizing the buffer/remote unpack routines.

@WeiqunZhang
Copy link
Member Author

@AlexanderSinn But const std::size_t N = std::max(m_hunk, std::max(freed_bytes, nbytes)) defeats the purpose of using memory arena. For this test (https://github.com/WeiqunZhang/amrex-devtests/tree/main/defrag), using const std::size_t N = std::max(m_hunk, std::max(freed_bytes, nbytes)) will result in about 1500 calls of cudaMalloc, whereas with this PR, it's only 85.

@WeiqunZhang
Copy link
Member Author

@ax3l I thought with development your test ran out of memory.

@ax3l
Copy link
Member

ax3l commented Jun 28, 2025

Yes, on 512 nodes. I am good on 1 node. This test is on one node.

@ax3l
Copy link
Member

ax3l commented Jun 28, 2025

I added the number of calls to hipMalloc above.

all but the first entry use init size of 1

patch allocs frees runtime peak GB
development w/ "default arena_init_size" 9 12 65.86 103 (constant)
development 126 129 76.48 83
this PR 151 154 75.97 78.5
Alex' patch 358 361 92.28 63
development w/ release threshold 0 4096 3927 166.19 57

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants