Make two improvements to the cache of remote data #26923

mppf · 2025-03-13T21:33:36Z

This PR makes two improvements to the runtime code supporting --cache-remote

It frees nonblocking operation descriptors more eagerly, which impacts CHPL_COMM=ofi
It adds thread fences to prevent the compiler or processor from reordering around the "lock" strategy added in PR Adjust remote cache to better handle task yields in comm events #16885.

Freeing nonblocking descriptors more eagerly

The issue is that the --cache-remote code will do non-blocking PUTs (generally speaking). Meanwhile, the ofi comm layer allocates some memory for each nonblocking handle. There is code to free a handle in do_wait_for, but this code will only free one handle per trip through the while loop, and the while loop is stopped if the pending sequence numbers of the later requests are larger. However, the runtime might have already completed such requests, so this PR adjusts this loop so that it keeps running until one of the following conditions is met:

there are no more pending requests
there is a pending request with a later sequence number that the comms layer does not think is already complete

Before this PR, this loop would stop if there was a pending request with a later sequence number, regardless of whether the comms layer thinks it is complete. That behavior resulted in the possibility of having many comm handles stacking up for completed operations, waiting to be freed.

Thread fences

In separate work, I had some code using a per-pthread data structure that is shared by many tasks & I was using a similar "lock"ing strategy as in the cache from PR #16885. I ran into problems where multiple tasks seemed to be accessing the same pthread-local state, but I was able to fix that by adding thread fences to aquire and release. On x86, I don't think these will impact what the hardware is doing (because of Total Store Order), but they will impact what compiler optimization does. I think we have just gotten lucky in terms of not being bitten yet by compiler optimization breaking these assumptions for --cache-remote. So, this PR adds an acquire fence when reading the "lock" unless I'm pretty sure the code is already holding the lock (this is done in a bunch of asserts and other checking); it adds an acquire fence when acquiring the lock, and it adds a release fence when releasing the lock. These fences should prevent compiler or processor from reordering things in a way that break the "lock", even though it does not synchronize between threads.

full comm=gasnet oversubscribed testing

--- Signed-off-by: Michael Ferguson <mppf@users.noreply.github.com>

mppf added 2 commits March 13, 2025 17:27

Fix too much memory being used by comm handles on ofi

868cf56

--- Signed-off-by: Michael Ferguson <mppf@users.noreply.github.com>

Add thread fence to prevent compiler / processor reordering

488d585

--- Signed-off-by: Michael Ferguson <mppf@users.noreply.github.com>

mppf changed the title ~~Fix too much memory being used by comm handles on ofi~~ Make two improvements to the cache of remote data Mar 18, 2025

mppf requested a review from jhh67 March 18, 2025 21:48

jhh67 approved these changes Mar 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make two improvements to the cache of remote data #26923

Make two improvements to the cache of remote data #26923

mppf commented Mar 13, 2025 •

edited

Loading

Make two improvements to the cache of remote data #26923

Are you sure you want to change the base?

Make two improvements to the cache of remote data #26923

Conversation

mppf commented Mar 13, 2025 • edited Loading

Freeing nonblocking descriptors more eagerly

Thread fences

mppf commented Mar 13, 2025 •

edited

Loading